The problem
Most Android teams shipping AI features today are running everything through cloud APIs. Gemini, OpenAI, whatever. It works. The user gets the feature.
But here's the thing — the bill compounds with your user count, not your revenue. Once you're past 100K monthly active users on a feature that uses AI on every interaction, the API line item stops looking like infrastructure and starts looking like a problem.
So I ran a real test. Three production scenarios. Same Pixel 8 device. Same prompts in both paths. Compared fully on-device (Gemma E2B / E4B) against fully cloud (Gemini 2.5 Flash). Token-counted everything. Calculated exact spend.
Here's what I found.
The setup
Three features that show up in a lot of mobile apps:
- Voice memo cleanup. User dictates, app cleans up the text — strips fillers, fixes grammar, preserves intent. 27-second audio sample.
- Receipt parsing. User snaps a restaurant receipt, app extracts vendor, date, line items, total into structured JSON.
- Email drafting with PII. User writes a rough email with personal details, app polishes it — but PII never reaches the cloud.
Same prompt for both local and cloud paths in each test. Token counts pulled from usageMetadata for cloud, character-based estimate (~4 chars/token) for local since LiteRT-LM doesn't expose per-call counts on Android.
Results
Voice memo cleanup
Both produced essentially the same cleaned text. Local kept a couple of "Okay. So" fillers that the cloud removed. But this can be solved by better prompting from my side in future. Otherwise — comparable quality.
- Local: ~16.4 sec, free
- Cloud: 5.5 sec, $0.00113 per call
The cost on cloud is dominated by audio tokens. Audio bills at $1.00 per million on Gemini 2.5 Flash — over 3x the text rate. For a 27-second clip that's 864 audio tokens at $1/M, plus the prompt at text rate.
Receipt parsing
This is where local still struggles. Gemma E4B got the vendor name wrong ("CROWNE HOTEL" instead of "CROWNE PLAZA"), got the date wrong, slightly misread a line item. Cloud got it right.
- Local: 50.3 sec, free — but with OCR errors
- Cloud: 4.4 sec, $0.00047 per call, accurate
If your business depends on accurate receipt data — local isn't ready for this case yet. 11x slower with worse quality is a hard sell. Probably 6–12 months until on-device vision catches up.
Email drafting with PII
Pure cost analysis here is a trap. The hybrid pipeline doesn't save you money on this case — the cloud polish step costs about the same whether the email was sanitized first or not.
What it gives you is something different: GDPR compliance by design. Local sanitize replaces PII with placeholders. Cloud polishes the placeholdered version. Local rehydrates the original values back in. The cloud model never sees a real name, email, phone number, or card.
If you're shipping in the EU, or in any privacy-sensitive vertical, this isn't optional. It's what makes the feature legally shippable.
What it actually saves
Per-month spend, assuming each user uses each feature once per day, 30 days:
| Users | All-cloud | Hybrid | Saved/month | Saved/year |
|---|---|---|---|---|
| 1K | $63 | $15 | $48 | $576 |
| 10K | $631 | $151 | $480 | $5,765 |
| 100K | $6,312 | $1,508 | $4,804 | $57,650 |
| 1M | $63,117 | $15,075 | $48,042 | $576,500 |
These numbers come from real per-call costs:
- Voice: $0.00113 (cloud) → $0 (on-device)
- Receipt: $0.00047 (cloud) → $0 (on-device)
- Email polish: $0.00050 — same in both modes (privacy win, not cost win)
At 100K MAU, you're saving ~$57K/year. At 1M, over half a million dollars annually. Below 100K, savings are real but not the kind that would change your roadmap.
Limitations (read this part)
A few things this analysis doesn't pretend to solve:
Latency hurts on local. Voice cleanup takes 16 seconds on-device vs 5 in cloud. Receipt parsing takes 50 vs 4. For background processing this is fine. For interactive flows where the user is waiting — not always acceptable. On-device makes sense when the user can do something else while it runs.
Image quality gap is real. Receipt OCR on Gemma E4B isn't reliable enough for production. If your feature requires accurate visual understanding, this part stays in the cloud for now.
Memory pressure on real devices. A Pixel 8 has 8GB RAM and is the baseline I tested. On older or lower-tier devices, loading a 4B parameter model for receipt parsing isn't even possible. Device tier strategy is a real engineering decision, not just an implementation detail.
Local model "free" isn't quite free. It's free in API costs. It costs in download size (~1.5GB per model), battery, thermal pressure, and longer first-run experience. These tradeoffs are real but don't show up in this calculation.
What this means for your team
If you're running cloud-only AI on Android with growing usage, three questions worth asking:
- Which of your features are text-only and not latency-critical? Those are the easiest wins. Voice cleanup, basic summarization, intent extraction — all candidates for on-device today.
- Are you shipping in privacy-regulated markets? If yes, the email-style hybrid pipeline isn't a "nice to have" — it might be a compliance requirement.
- What's your monthly Gemini/OpenAI bill projected to be at 10x current scale? If that number makes anyone uncomfortable, this is a conversation worth having now, not after.