Gemma 4 on iPhone: $0/query, Pay in Heat


TL;DR

Google AI Edge Gallery landed on iOS on April 3, 2026 [1]: fully on-device, $0/query, free app. The catch: you’re running E2B (2.3B effective parameters) or E4B (4.5B effective parameters) [1], not the 26B or 31B variants that are actually competitive on hard tasks.

The cost angle, plainly

Here’s the number Token Miser readers care about: $0. Every query runs locally on the device, no API call, no per-token bill. Google AI Edge Gallery is a free download, the models pull from Hugging Face inside the app, and inference never touches a server. For summarization, basic Q&A, voice transcription, and single-image lookups that fit inside a context window, that workload just got free.

The hardware floor is real. E2B needs an iPhone 15 Pro (6 GB RAM); E4B needs an iPhone 16 Pro (8 GB RAM) [1]. The Q4-quantized weights download at setup: 1.3 GB for E2B, 2.5 GB for E4B [1]. And while the base models spec out at 128K context [2], the LiteRT runtime format has a documented 32K cap on-device [1]. Don’t plan your long-document pipeline around the headline number.

Thermal: community testers on Hacker News clocked ~30 tokens/sec decode on iPhone 16 Pro, with the phone running hot under sustained load [3]. Fine for Q&A and short summaries. A continuous agentic loop is a different conversation; the phone has opinions about being a server.

What E2B and E4B can and can’t do

Gemma 4 shipped in four variants on April 2, 2026 (Apache 2.0) [2]. The app supports the two smallest. E2B: 2.3B effective parameters, ~1.3 GB weights. E4B: 4.5B effective parameters, ~2.5 GB weights [1]. The 26B MoE and 31B Dense are not available in the mobile app [1]. For context on the gap: Gemma 4’s 31B scores 89.2% on AIME 2026 [2] and ranks #3 among open models on Arena AI [2]. E4B’s published AIME 2026 score is 42.5% [2]. That’s what free costs you in reasoning headroom. These models work, they’re not toys, but they’re not going to outthink a frontier cloud model on hard reasoning or coding tasks. The gap between open-source edge models and proprietary cloud incumbents has been shrinking steadily over the past couple of years. It hasn’t closed.

Features are solid for the tier: multimodal image analysis, on-device voice transcription, thinking-mode chat, and Mobile Actions for natural-language device control, all offline [1]. Built-in skills include Wikipedia lookups, maps, and QR generation; they run sandboxed, secrets through native dialogs [1].

What this costs you

If you’re paying per token for summarization, voice transcription, image Q&A, or simple device actions, and your workload fits inside a 32K runtime context window [1], Google AI Edge Gallery runs that for $0/query on an iPhone 15 Pro or better. Your phone will get warm. Complex reasoning and coding stay cloud problems for now.

Worth a benchmark session as a cost floor. Not a production swap.

Fact Check
  1. Google Developers Blog: “Bring state-of-the-art agentic skills to the edge with Gemma 4”: iOS launch date (April 3, 2026), E2B/E4B parameter counts, hardware floor (iPhone 15 Pro / iPhone 16 Pro), Q4-quantized weight sizes (1.3 GB / 2.5 GB), LiteRT 32K on-device context cap, available variants (E2B and E4B only in app), features list (multimodal, voice, thinking mode, Mobile Actions, built-in skills)
  2. Google Gemma 4 model card on Hugging Face: [unverified] base model 128K context window spec, Apache 2.0 license, April 2, 2026 release date, AIME 2026 scores (31B: 89.2%, E4B: 42.5%), Arena AI #3 open-model ranking — these figures appear in the Google Developers Blog post and/or Gemma technical report but the primary source URL was not included in the original post’s source block. Charles to verify and update this entry with the correct primary source URL.
  3. Hacker News discussion: [unverified] ~30 tokens/sec decode speed on iPhone 16 Pro under sustained load, reported by community testers. No direct URL was captured in the original source block. Charles to locate the specific HN thread and add the URL here.