On-Device AI Agents in 2026: Capabilities and Limits

Google's FunctionGemma — a 270-million-parameter model running on a phone — can translate "Create a calendar event for 2:30 PM tomorrow" directly into a system command without touching a server. Apple Intelligence's Image Wand turns a sketch into a finished image inside the Notes app, entirely on-device. Microsoft Copilot's Recall semantically searches your past activity, locally. The on-device AI agent isn't an idea anymore — it's a category.

Quick Answer: On-device AI agents are AI systems that run multi-step planning, autonomous actions, and semantic search directly on a smartphone, laptop, or wearable — no cloud round-trip. The 2026 stack: Apple Intelligence (iPhone, iPad, Mac, Vision Pro), Google's Gemma 4 with FunctionGemma (270M parameters) and AI Edge Gallery, Microsoft's Copilot with Recall. Capabilities: live translation, function calling, sketch-to-image, semantic memory search. Limits: 8 GB RAM is still typical on flagship phones, blocking models above roughly 7B parameters; complex multi-document reasoning still belongs to cloud LLMs.

An on-device AI agent is an artificial intelligence system that runs directly on a user's smartphone, laptop, wearable, or other edge device — performing multi-step planning, autonomous actions, and tool-calling without sending data to cloud servers. Unlike traditional voice assistants that respond to a single command at a time, agents can chain multiple steps to complete complex tasks.

Isometric smartphone with on-device AI agent performing multiple tasks — planning, calendar, summarisation — all locally

What "Agent" Means in 2026 — and Why It's Different from a Voice Assistant

The word "agent" has been overloaded for years. The 2026 version means something specific. An agent:

Plans across multiple steps rather than executing a single instruction
Calls tools — opening apps, modifying calendar entries, retrieving documents — autonomously based on the plan
Holds context across the chain of actions so each step builds on the previous one
Decides when it's done instead of waiting for the next user prompt

A 2010 voice assistant did one thing per turn ("set a 5-minute timer", "what's the weather"). A 2026 on-device agent can take a prompt like "find my last email about Q3 budget, summarise the key concerns, and draft a 3-point reply" and execute all four steps without further input.

The under-appreciated change is that this is happening on-device. Google's FunctionGemma — the model behind the Mobile Actions demo in Google AI Edge Gallery — consists of only 270 million parameters yet effectively translates natural language into function calls directly on mobile devices. That's a 270M-parameter model doing real tool-calling, locally, in your pocket.

Industry watchers describe the broader shift as moving from "passive assistants to active agents that orchestrate entire workflows" — exemplified by supply-chain agents communicating with compliance agents and financial forecasting agents working autonomously. On-device is where that pattern lands on consumer hardware.

What On-Device AI Agents Can Do Today

The 2026 capability list is concrete:

Multi-step planning and autonomous actions. Google's Gemma 4, launched April 2026, enables developers to create on-device agents that run multi-step planning, autonomous actions, and offline code generation without specialised fine-tuning. Gemma 4 also augments its own knowledge by querying external sources like Wikipedia (when available), responding to questions beyond its training data.

Function calling on the device. Google's Mobile Actions demo translates natural-language commands like "Show me the San Francisco airport on map" or "Create a calendar event for 2:30 PM tomorrow for cooking class" into system commands — entirely on-device, with no server dependency. The Tiny Garden demo within Google AI Edge Gallery is an interactive game where voice commands manage a virtual garden, showing that the FunctionGemma model operates without server dependence.

Live translation across messages and calls. Apple Intelligence's Live Translation automatically translates messages in real time, displays live translated captions in FaceTime, and provides spoken translations for phone calls — entirely on-device. Live Translation works in Messages, FaceTime, and Phone, with eight more languages added by end of 2025 (Danish, Dutch, Norwegian, Portuguese (Portugal), Swedish, Turkish, Chinese (Traditional), Vietnamese). For the full breakdown, see our offline language learning article.

Writing assistance across third-party apps. Apple's Writing Tools can proofread text, rewrite different versions until the desired tone is achieved, and summarise selected text with a tap — available in nearly all writing applications, including third-party apps.

Sketch-to-image generation. Image Wand in the Notes app transforms a rough sketch into a related image: circle the sketch, the AI creates a complementary visual based on the surrounding context. All on-device.

Semantic search across personal history. Microsoft Copilot's Recall, introduced by Satya Nadella, allows users to search their history semantically — scrolling back to retrieve documents, applications, and more from their past, by meaning rather than keyword.

On-device transcription and live captioning. Google Meet has deployed an Ultra-HD segmentation model that is 25 times larger than previous versions by leveraging mobile NPUs without sacrificing inference speed during typical 20–30 minute sessions. The Argmax Pro SDK transitioning from GPU to NPU saw over double the speedup while enabling reliable on-device live transcription for extended periods with minimal battery impact.

Real-time character animation. Epic Games' Live Link Face app for Android uses on-device NPU acceleration to enable real-time MetaHuman facial animation at up to 30 frames per second, streaming directly into Unreal Engine.

That's the capability ceiling in 2026. Each of these would have required a server round-trip a year ago.

The Major Platforms: Apple, Google, Microsoft, Meta

Four strategies are competing for the on-device agent layer.

Apple Intelligence — Privacy as positioning

Apple Intelligence is built into iPhone, iPad, Mac, and Apple Vision Pro, helping users write, communicate, and manage tasks while maintaining privacy at every step. Apple has described its on-device approach as the cornerstone for security and privacy, stating that data residing only on user devices reduces the risk of centralised attacks.

Apple's monetisation model is indirect: rather than offering a standalone chatbot subscription, the company uses Apple Intelligence features to encourage users to upgrade from older iPhones that can't support the latest on-device models. Apple Intelligence has not yet rolled out in mainland China at full feature parity, where regulatory approval and partnership development with local AI providers remain challenges.

For tasks that exceed the on-device model, Apple introduced Private Cloud Compute — a cloud intelligence system that combines powerful generative models with the privacy and security guarantees of on-device processing. End-to-end encryption ensures that personal user data sent to PCC remains accessible only to the user, preventing unauthorised access including from Apple itself.

Google Gemma 4 and AI Edge Gallery — Open models on the edge

Google's approach is to ship open models and a developer ecosystem. Gemma 4 enables multi-step planning, autonomous actions, and offline code generation without specialised fine-tuning. LiteRT-LM — the framework optimising Gemma 4 — allows the model to run with a memory footprint of less than 1.5 GB on some devices, including IoT and edge devices like the Raspberry Pi 5.

The Google AI Edge Gallery (now available on iOS as well as Android) lets users experiment with multi-turn chat and local transcription, leveraging Google's on-device performance and privacy. The Gallery now features NPU support for select models, allowing developers to test and validate NPU acceleration on mobile devices.

Microsoft Copilot + Recall — The PC angle

Microsoft's bet centres on the AI PC. On-device Copilot — including Recall — enables semantic search of user history, with multimodal understanding and local data processing as core capabilities. Recall flips the search paradigm: instead of remembering filenames or exact phrases, you describe what you were doing and the local model retrieves it.

Meta Llama - Open weights, broad deployment

Meta's strategy emphasises open-source models that can be deployed on-device, while maintaining cloud-scale capability for more demanding tasks. This is a contrast to Apple's device-first privacy approach: Llama gives developers more deployment flexibility but doesn't make the same end-to-end privacy guarantees.

Hardware: NPUs, RAM, and What Actually Constrains Performance

Isometric diagram of on-device AI agent hardware constraints — RAM, NPU, and model size limits

The 2026 hardware story is split into "what's improved" and "what's still binding."

What's improved: NPUs

Modern Neural Processing Units make once-impossible workloads feasible. Google Meet deployed a segmentation model 25× larger than its predecessor by routing it through mobile NPUs without sacrificing speed. Argmax Pro SDK saw over double the speedup transitioning from GPU to NPU for on-device speech recognition, with reliable live transcription extending battery life by reducing the load on the main CPU.

The new Wi-Fi infrastructure is keeping pace. Qualcomm's FastConnect 8800 Mobile Connectivity System, introduced March 2026, is the first mobile solution with a 4×4 Wi-Fi radio configuration, enabling speeds beyond 10 Gbps. The same chip features Proximity AI technology — UWB plus Bluetooth Channel Sounding plus Wi-Fi Ranging — enabling centimetre-accurate cross-device tracking. For broader NPU context, see our Neural Processing Unit explainer.

What's still binding: RAM

Compute isn't the bottleneck anymore — RAM is. Reality check from 2026 hardware data:

The most common laptop configuration is 16 GB of RAM; 8 GB remains very typical
Apple ships the iPhone 16e and 17 with 8 GB of RAM
A 7-billion-parameter AI model requires approximately 5 GB of RAM to run
Effective on-device agents need a minimum of 32,000 tokens of context, which requires significantly more RAM than typically available

Bottlenecks in RAM supply, rising costs, and the prioritisation of more expensive data centre RAM over consumer RAM mean more capable consumer devices are unlikely to improve significantly in the near future. The on-device agent capability ceiling in 2026 is determined less by chip speed and more by memory headroom — and that headroom is not climbing as fast as the model ecosystem would benefit from.

Mobile agent scaling research

Academic work on scaling mobile agent systems frames the constraint formally: mobile devices are fundamentally constrained by limited computation, memory, and energy budgets, making it impractical to deploy high-capacity models that often exceed 100 billion parameters. The proposed solution is two-dimensional scaling — improving individual agent capability through compression techniques (pruning, quantization, distillation) and enabling collective intelligence through multi-agent collaboration across devices.

When On-Device Wins, When Cloud Still Does

The choice between on-device and cloud agents is not binary — it depends on the workload.

On-device wins on:

Privacy. Data never leaves the device, eliminating the centralised-attack surface and aligning structurally with GDPR's "privacy by design" principle. For more on the GDPR side, see our data minimization article.
Latency. No network round-trip means responses arrive instantly — critical for live translation, real-time captioning, and interactive games.
Offline reliability. Agents continue working without connectivity — useful for travel, remote areas, and any context where Wi-Fi is unreliable or unavailable.
Cost. No per-token API fees, no subscription required for the underlying model, no recurring cloud compute/storage costs.
Multimodal local data. Recall, Image Wand, and similar features depend on access to a user's local data that the cloud genuinely shouldn't see.

Cloud still wins on:

Model capability for frontier tasks. Vast cloud compute supports models orders of magnitude larger than what fits in 8 GB of phone RAM. For complex multi-document reasoning, deep research, and novel-quality generation, cloud LLMs still have a meaningful edge.
Fresh information. Cloud agents can integrate live web data, current news, and real-time updates. On-device models work only with what they've been trained on, plus what's stored locally.
Collaboration features. Real-time co-editing, shared workspaces, and multi-user agent workflows are easier to coordinate centrally.
Centralised compliance and audit. Enterprises that need centralised logging, compliance auditing, and policy enforcement across thousands of devices often find this easier in the cloud model.

The honest framing is that on-device and cloud are complementary tiers in a single workflow — Apple's Private Cloud Compute architecture is one explicit example of this hybrid pattern.

What On-Device Agents Still Can't Do

The capability gaps in 2026 are specific and worth naming clearly:

Frontier-grade reasoning. Complex multi-step reasoning, sophisticated mathematical work, and PhD-level domain analysis still benefit substantially from larger cloud models. A 270M-parameter FunctionGemma is brilliant at translating speech to function calls; it's not going to write a research paper.

Long-context document work. Most on-device models have smaller context windows than their cloud counterparts. For 100-page document analysis, multi-source synthesis, or codebases as input, cloud APIs remain the practical choice.

Real-time web information. On-device agents can't pull current events, latest research, or live data feeds without network access. For research-grounded tasks, cloud RAG over web sources is more capable.

Cross-device coordination at scale. Mobile agent systems struggle with collective intelligence across many devices in practice. Academic frameworks for two-dimensional scaling (compression plus multi-agent collaboration) exist, but production-quality multi-device agent ecosystems are still emerging.

Deployment expertise barrier. On-device AI requires specialised expertise in edge computing for successful deployment and optimisation. For organisations without this expertise, cloud APIs offer a lower-friction starting point even when on-device would be technically superior.

Update friction. Whether the AI model is static or requires periodic updates is a key consideration; over-the-air updates are non-trivial to manage across diverse device fleets. Cloud models update silently; on-device models need explicit, sometimes user-mediated updates.

Frequently Asked Questions

What is an on-device AI agent?

An on-device AI agent is an AI system that runs entirely on a user's phone, laptop, or wearable — performing multi-step planning, autonomous actions, and tool-calling without sending data to a cloud server. Unlike traditional voice assistants that respond to one command at a time, agents can chain multiple steps to complete more complex tasks.

How are on-device AI agents different from voice assistants like Siri or Alexa?

Traditional voice assistants handle single commands or queries. On-device AI agents can plan and execute multi-step workflows — for example, summarise my email, draft a reply, and add a calendar event — autonomously. Modern agents like Google's FunctionGemma (270M parameters) translate natural language directly into system function calls on-device.

Which devices support on-device AI agents in 2026?

Apple Intelligence runs on iPhone, iPad, Mac, and Apple Vision Pro. Google's Gemma 4 and AI Edge Gallery run on Android and iOS. Microsoft Copilot on Windows includes Recall for semantic history search. Most flagship phones from 2024 onwards have NPUs capable of running on-device agent models — though the iPhone 16e and 17 ship with 8 GB of RAM.

What can on-device AI agents do today?

In 2026, on-device agents handle live translation, multi-turn conversation, function calling (creating calendar events, opening maps), sketch-to-image generation, semantic search across personal history, summarisation, and offline code generation. Google's Gemma 4 supports multi-step planning and tool use without specialised fine-tuning; Apple Intelligence's Writing Tools work in nearly all third-party apps.

What are the limitations of on-device AI agents?

Three main limits: RAM is the binding constraint — most phones ship with 8 GB while a 7B-parameter model needs around 5 GB. Compute restricts model capability for complex reasoning, multi-document work, and frontier-grade outputs. And on-device models can't access real-time web information. For workflows needing maximum capability or fresh data, cloud AI still wins.

Conclusion

The 2026 on-device AI agent is no longer aspirational. Google's FunctionGemma demonstrates that a 270-million-parameter model can do real natural-language-to-function-call translation. Gemma 4 supports multi-step planning, autonomous actions, and offline code generation. Apple Intelligence ships Live Translation, Writing Tools, Image Wand, and a deeply integrated agent layer across iPhone, iPad, Mac, and Vision Pro. Microsoft Copilot's Recall delivers semantic search over local history. The category is here.

The honest framing of the limits matters just as much. RAM is the binding constraint — 8 GB is still typical, while a 7B model needs around 5 GB, and effective agent context windows want 32K+ tokens. Frontier reasoning, long-context document work, real-time web information, and cross-device coordination at scale all still favour cloud agents. The hybrid pattern (Apple's Private Cloud Compute is the most explicit example) is the realistic architecture for 2026 and 2027.

For developers, the question is no longer whether on-device agents are viable — it's which subset of your workflow genuinely benefits from the privacy, latency, and offline properties of running locally, and which subset still belongs in the cloud. For users, the question is simpler: pick the platform whose privacy and capability mix matches your priorities. That ecosystem exists now.

Your Private, Offline AI Assistant.