AI That Lives on Your Device: The Challenges and Promise of On-Device Models

Every time you ask a cloud-based AI a question, your words travel to a data center, get processed by a model running on expensive GPUs, and the response travels back. It takes a few hundred milliseconds on a good connection. On a bad one, noticeably longer. And somewhere in a server rack, a copy of your question exists in a log.

On-device AI changes all of that. The model lives on your phone, your laptop, your browser. The data never leaves. The response is instant. And nobody else ever sees your query.

That promise is why on-device models are one of the most important frontiers in AI right now, and why it's central to what we're building at Firefox.

Close-up of a smartphone's internal circuitry — the hardware where on-device AI models are starting to run locally
On-device AI means the data never leaves, the response is instant, and nobody else sees your query. That promise is why it's central to what we're building at Firefox. Photo by solarseven on Pexels.

What's already here

The progress in the last year has been remarkable. Google's Gemini Nano, a 1.8B and 3.25B parameter model, runs natively on Pixel devices, powering features like call summarization and smart replies. Apple's on-device models handle Live Voicemail transcription and Siri enhancements in iOS 18. Meta released Llama 3.2 in September with 1B and 3B parameter variants explicitly designed for edge deployment.

The small language model space is exploding. Microsoft's Phi-3 mini packs 3.8B parameters into a model that runs on mobile hardware. Google's Gemma 2 brings competitive performance at 2B parameters. Alibaba's Qwen2.5 offers variants from 500M to 1.5B. These models are surprisingly capable for their size, particularly at focused tasks like summarization, classification, and extraction.

The challenges are real

But let's be honest about what on-device AI can't do yet.

Memory and compute constraints. Running even a 3B parameter model on a phone requires several gigabytes of RAM and meaningful battery drain. Quantization helps (INT4 can cut memory requirements dramatically), but there's a quality tradeoff. Inference speeds on consumer hardware are still measured in single-digit tokens per second for larger models, far slower than cloud-based alternatives.

Context windows are limited. Cloud models now handle 100K+ tokens of context. On-device models typically max out around 512 to 2,000 tokens. That's enough for a short summary or a quick classification, but not enough for complex reasoning over long documents.

Model staleness. A cloud model can be updated continuously. An on-device model is frozen at the moment it was downloaded. If the world changes (new events, new products, updated information), the model doesn't know.

Ecosystem fragmentation. Unlike the cloud, where you pick an API and go, on-device deployment means contending with dozens of hardware configurations, NPU architectures, and OS-level constraints. What runs smoothly on a Pixel 8 Pro might not run at all on a mid-range Android device from two years ago.

The opportunity is bigger than the challenges

Despite all of this, I'm bullish on on-device AI for one simple reason: privacy and latency aren't features. They're prerequisites for the kinds of AI experiences people will actually trust with their most sensitive tasks.

Think about what an AI assistant in a browser needs to handle: your search history, your financial information, your medical questions, your personal messages. Would you send all of that to a cloud server owned by a company whose business model is advertising? For many people, the answer is no, and that "no" is a permanent constraint that cloud-only AI can't design around.

On-device models solve this at the architecture level. The data never leaves. There's nothing to leak, nothing to subpoena, nothing to monetize. For a privacy-first organization like Mozilla, this isn't a nice-to-have. It's the foundation.

The latency advantage matters too. Sub-100ms response times on flagship devices with NPU acceleration enable experiences that feel instantaneous: real-time suggestions as you type, immediate classification of content, instant summarization of the paragraph you're reading. These micro-interactions are where AI goes from "useful tool" to "invisible layer of intelligence."

The hybrid future

The realistic near-term architecture isn't purely on-device or purely cloud. It's hybrid. Simple, latency-sensitive tasks (summarization, classification, autocomplete, content warnings) run on-device. Complex, knowledge-intensive tasks (research synthesis, long-context reasoning, generation of novel content) call the cloud when the user opts in.

The art is in the routing: knowing which tasks can be handled locally and which require more horsepower, and giving the user transparent control over when their data leaves the device.

Why this matters for the browser

At Firefox, on-device AI is core to our approach. A browser that can summarize a page, suggest tab groups, translate text, or flag suspicious content, all without sending your data anywhere, is a fundamentally different product from one that phones home for every AI interaction.

The models are small. The constraints are real. But the trajectory is clear: on-device AI is getting more capable every quarter, and the use cases that matter most, the private, personal, trust-dependent ones, are exactly the ones it's best suited for.

The AI that lives on your device isn't a compromise. It's the future.

← All writing Follow on LinkedIn →