Guide · On-device AI · Android

Local LLMs on Android in 2026: A field guide for product owners

A guide for CEOs and business owners weighing on-device AI on Android. Based on production tests with real Pixel hardware, real Gemma 4 weights, real Gemini cloud calls, token-counted.

By Dmytro Samoilov

You've decided you want some piece of your AI stack to run on the user's phone instead of in the cloud. Maybe it was the cost projection at 10x scale. Maybe it was the EU compliance team.

This piece is a field guide for the person who has to decide whether to invest in on-device AI on Android, and what to ask the engineers when they come back with answers. It's based on my own production experiments — a real Pixel 8, real Gemma 4 weights, real Gemini 2.5 Flash calls, with usageMetadata token counts on every single one — and a few weeks of getting frustrated at the documentation.

Why the documentation lies to you

In the past 18 months Google has renamed the entire on-device AI surface twice and quietly retired the project that most tutorials still recommend.

In June 2024, TensorFlow Lite became LiteRT. Same code, new name. Most search results still say "TFLite."

In late 2025, MediaPipe Tasks GenAI — the path Google had recommended for running Gemma on Android — was retired in favor of LiteRT-LM. Different file format (.task.litertlm), different library, same idea. If your engineers come back saying "we'll use MediaPipe LLM Inference," that's a stale answer from a 2024 tutorial.

Layered on top of this, "AI Edge SDK" is a marketing umbrella, not a library. When someone uses it, ask which actual library they mean.

The result is that if you Google "how to run an LLM on Android" in 2026, the top results are wrong, half-wrong, or talking about deprecated tools. This is unusual; the Android ML space has historically been more stable. It's the on-device LLM corner specifically that's churning, because the hardware and the use cases are both moving fast.

I'm going to skip the archaeology and tell you what's actually true today.

The two paths that matter

By 2026, the choice for almost any new project on Android is between two paths.

Path 1 — LiteRT-LM with your own model. You ship a Gemma 4 (or Phi, or another compatible model) inside your app. You control the model, the prompts, the version. It runs on a wide swath of Android devices. You're responsible for updates and for the file size.

Path 2 — ML Kit GenAI with Google's Gemini Nano. You call a Google API. The model is downloaded and managed by an Android system service called AICore. You don't ship the model; you don't update it; you don't see the weights. In return, the device must be on Google's allow-list — which I'll describe in detail in the fragmentation section.

Everything else — the legacy paths, the third-party runtimes (llama.cpp, MLC LLM), the vendor SDKs (Qualcomm QNN) — exists for edge cases. If you are starting a new project today, your default choice is between these two paths, and the most common end state is using both for different features in the same app.

The first graph in this piece is the entire library ecosystem in one image. Look at it once and you have most of what your engineers will be discussing.

Graph 1 — The Android on-device LLM library ecosystem in 2026.

When to pick which path

A simple frame:

Your situation Path Why
Need it to work on most Android phones, not just flagships LiteRT-LM No vendor allow-list. Runs on a 2022 mid-range phone.
Want a polished feature with no model management ML Kit GenAI Google ships and updates the model. Twenty lines of Kotlin.
Want maximum quality on a narrow task (summary, ASR, proofread) ML Kit GenAI Gemini Nano outperforms Gemma 4 E2B on many of these.
Plan to share weights or behavior with iOS one day LiteRT-LM Your weights, your control. Portable.
Privacy-regulated data (PII, health, finance) Either, on-device The point is that data never leaves the phone.
Long context, complex reasoning, accurate vision OCR Cloud (Gemini 2.5 Flash, Claude, GPT) On-device is not there yet for these.

What it actually costs (and what doesn't)

I tested three production scenarios on a Pixel 8 with 8 GB of RAM, running the same prompts on-device (Gemma 4 E2B and E4B via LiteRT-LM) and in the cloud (Gemini 2.5 Flash). You can find a detailed comparison with numbers in the companion case study: Hybrid AI cost savings on Android.

Otherwise, here are the bullet points to get a general idea:

The catch: device fragmentation

If everything I've said so far makes on-device AI sound straightforward, here's where it stops being straightforward.

Both paths have device-coverage problems, but they differ in shape.

LiteRT-LM runs on most Android devices, but how well depends on the silicon and the available delegates. A flagship Snapdragon will do hardware-accelerated NPU inference. A stock Pixel 8 falls back to CPU and feels it. A budget phone from 2022 might not have enough RAM to load Gemma 4 E4B at all.

ML Kit GenAI / Gemini Nano is a stricter story. AICore is an Android system service shipped as its own APK (com.google.android.aicore), and it's available only on devices Google and the OEM have approved. As of 2026 the allow-list looks like this:

Conspicuously absent in 2026: Oppo, Vivo, Sony, Asus, Honor, and most of the Chinese-market-only flagships. Not because their hardware can't do it — most ship with Snapdragon 8 Elite or MediaTek Dimensity 9500 silicon that is more than capable. They are simply not on Google's allow-list.

If your engineers say "let's use Gemini Nano," the immediate follow-up question is: which percentage of your users actually have it? In the EU and US, that number is dominated by Samsung. In India and Southeast Asia, Xiaomi, Oppo, and Vivo dominate the market — and Oppo and Vivo aren't on the list.

There's also the chipset layer underneath, which most decision-makers don't need to think about but which engineers will care about a great deal. A rough 2025–2026 map:

Manufacturer Typical 2025–2026 flagship chipset
Google Pixel Tensor G4 / G5 (custom, fabbed by Samsung/TSMC)
Samsung Galaxy S Snapdragon 8 Elite / Gen 5 in US and China; Exynos in some EU markets
OnePlus, Xiaomi (top), Asus ROG, Sony Xperia Snapdragon 8 Elite Gen 5
Oppo Find X9 / X9 Ultra Snapdragon 8 Elite Gen 5 or MediaTek Dimensity 9500, depending on the SKU
Vivo X300 Ultra / X300 Pro Snapdragon 8 Elite or MediaTek Dimensity 9500
Honor (top) Snapdragon 8 Elite
Mid-range and budget Older Snapdragon, MediaTek Dimensity, occasionally Exynos

The reason this matters: the on-device path you can pick depends not just on the brand but on the silicon underneath, because LiteRT-LM's fastest backend (Qualcomm QNN for NPU acceleration) is Qualcomm-only. A MediaTek Dimensity 9500 will run Gemma 4 perfectly well — just on CPU/GPU rather than NPU.

So now picture the same library map I showed you above, but with these device tiers laid underneath, mapping which devices can actually use which path. That second graph is the one you'll show your engineering lead before kicking off the project.

Graph 2 — The same ecosystem, with device fragmentation overlaid.

Privacy is the silent killer feature

Most case studies I've seen on on-device AI lead with cost. Cost is real, but at the scales most product teams are operating at, privacy is the bigger story.

In my email-PII test, the hybrid pipeline didn't save any money. The cost of the polish step is the same whether the email was sanitized first or not. What it gives you is something different and more valuable: the cloud model never sees a real name, email address, phone number, or credit-card number. GDPR isn't going to relax in 2026 or 2027. The EU AI Act is sharpening, not softening. HIPAA isn't budging.

Questions that can help you decide

  1. Which of your AI features are text-only and not interactive-blocking? Things like voice memo cleanup, background summarization, intent classification, tagging, auto-categorization — these are your easiest wins and your cleanest margins.
  2. Are you shipping in privacy-regulated markets? If yes, an on-device or hybrid pipeline is on the must-have list.
  3. What does your AI bill look like at 10x current scale? If the answer makes anyone uncomfortable, this is the conversation to have now. Moving features on-device takes 2–3 months of engineering and is the kind of work you want to start a quarter before you need it, not the week the bill becomes a problem.

If your honest answers are "we're an interactive consumer app, all our AI is real-time, we're not in a regulated market, and we have no scale problem yet," then on-device probably isn't your fight this year. Run cloud, watch costs, revisit when one of those answers changes.

For everyone else: there is a real case to be made, and the tooling in 2026 is finally good enough that the case is no longer theoretical.

Partnership

If your team is hitting the cost wall on cloud AI, or you need on-device AI working in production on Android, that's what I do. AI cost-optimization audits, hybrid architecture design, and shipping support for mobile teams.

Get in touch