Case study · On-device AI · Android

How a hybrid on-device + cloud architecture cuts AI costs by ~76% in production Android apps

Real numbers from three production scenarios on a Pixel 8 — what actually saves money, and what doesn't.

By Dmytro Samoilov

~$57,650 / year saved at 100K monthly active users

The problem

Most Android teams shipping AI features today are running everything through cloud APIs. Gemini, OpenAI, whatever. It works. The user gets the feature.

But here's the thing — the bill compounds with your user count, not your revenue. Once you're past 100K monthly active users on a feature that uses AI on every interaction, the API line item stops looking like infrastructure and starts looking like a problem.

So I ran a real test. Three production scenarios. Same Pixel 8 device. Same prompts in both paths. Compared fully on-device (Gemma E2B / E4B) against fully cloud (Gemini 2.5 Flash). Token-counted everything. Calculated exact spend.

Here's what I found.

The setup

Three features that show up in a lot of mobile apps:

Same prompt for both local and cloud paths in each test. Token counts pulled from usageMetadata for cloud, character-based estimate (~4 chars/token) for local since LiteRT-LM doesn't expose per-call counts on Android.

Results

Voice memo cleanup

Both produced essentially the same cleaned text. Local kept a couple of "Okay. So" fillers that the cloud removed. But this can be solved by better prompting from my side in future. Otherwise — comparable quality.

The cost on cloud is dominated by audio tokens. Audio bills at $1.00 per million on Gemini 2.5 Flash — over 3x the text rate. For a 27-second clip that's 864 audio tokens at $1/M, plus the prompt at text rate.

Receipt parsing

This is where local still struggles. Gemma E4B got the vendor name wrong ("CROWNE HOTEL" instead of "CROWNE PLAZA"), got the date wrong, slightly misread a line item. Cloud got it right.

If your business depends on accurate receipt data — local isn't ready for this case yet. 11x slower with worse quality is a hard sell. Probably 6–12 months until on-device vision catches up.

Email drafting with PII

Pure cost analysis here is a trap. The hybrid pipeline doesn't save you money on this case — the cloud polish step costs about the same whether the email was sanitized first or not.

What it gives you is something different: GDPR compliance by design. Local sanitize replaces PII with placeholders. Cloud polishes the placeholdered version. Local rehydrates the original values back in. The cloud model never sees a real name, email, phone number, or card.

If you're shipping in the EU, or in any privacy-sensitive vertical, this isn't optional. It's what makes the feature legally shippable.

What it actually saves

Per-month spend, assuming each user uses each feature once per day, 30 days:

Users All-cloud Hybrid Saved/month Saved/year
1K $63 $15 $48 $576
10K $631 $151 $480 $5,765
100K $6,312 $1,508 $4,804 $57,650
1M $63,117 $15,075 $48,042 $576,500

These numbers come from real per-call costs:

At 100K MAU, you're saving ~$57K/year. At 1M, over half a million dollars annually. Below 100K, savings are real but not the kind that would change your roadmap.

Limitations (read this part)

A few things this analysis doesn't pretend to solve:

Latency hurts on local. Voice cleanup takes 16 seconds on-device vs 5 in cloud. Receipt parsing takes 50 vs 4. For background processing this is fine. For interactive flows where the user is waiting — not always acceptable. On-device makes sense when the user can do something else while it runs.

Image quality gap is real. Receipt OCR on Gemma E4B isn't reliable enough for production. If your feature requires accurate visual understanding, this part stays in the cloud for now.

Memory pressure on real devices. A Pixel 8 has 8GB RAM and is the baseline I tested. On older or lower-tier devices, loading a 4B parameter model for receipt parsing isn't even possible. Device tier strategy is a real engineering decision, not just an implementation detail.

Local model "free" isn't quite free. It's free in API costs. It costs in download size (~1.5GB per model), battery, thermal pressure, and longer first-run experience. These tradeoffs are real but don't show up in this calculation.

What this means for your team

If you're running cloud-only AI on Android with growing usage, three questions worth asking:

  1. Which of your features are text-only and not latency-critical? Those are the easiest wins. Voice cleanup, basic summarization, intent extraction — all candidates for on-device today.
  2. Are you shipping in privacy-regulated markets? If yes, the email-style hybrid pipeline isn't a "nice to have" — it might be a compliance requirement.
  3. What's your monthly Gemini/OpenAI bill projected to be at 10x current scale? If that number makes anyone uncomfortable, this is a conversation worth having now, not after.

Want help with this?

If your team is hitting the cost wall on cloud AI, or you need on-device AI working in production on Android — that's what I do. AI cost optimization audits and hybrid architecture design for mobile teams.

Get in touch