Google Gemma 4: Four Open Models, Apache 2.0, and Benchmark Scores That Rewrite the Leaderboard (2026)
Table of Contents
Google Gemma 4: Four Open Models, Apache 2.0, and Benchmark Scores That Rewrite the Leaderboard
Google DeepMind dropped Gemma 4 on April 2, 2026. The release includes four model sizes and a switch from a custom license to Apache 2.0. Multimodal support spans all variants, and benchmark scores run so far ahead of Gemma 3 that the two feel like different product lines.
The 31B dense model debuted at #3 on the Arena AI open-model text leaderboard, behind GLM-5 and Kimi 2.5. The 26B mixture-of-experts (MoE) variant landed at #6. Both models outperform competitors with 20x their parameter count.
For developers running local inference, building on-device apps, or fine-tuning open models for production with tools like LlamaIndex, Gemma 4 changes the math on what's possible at each hardware tier.
What Ships in Gemma 4
Google released four models spanning edge devices to data center GPUs:
Gemma 4 E2B runs 2.3 billion effective parameters (5.1B total with embeddings), handles text and image inputs alongside audio and video, and fits a 128K context window. Google built it for phones, Raspberry Pi boards, and IoT hardware. Gemma 4 E4B steps up to 4.5 billion effective parameters (8B total with embeddings), keeps the same 128K context and full multimodal support, and targets mid-range mobile devices and laptops. Gemma 4 26B-A4B uses a mixture-of-experts architecture with 128 small experts, activating 8 per token plus one shared expert. Total parameters: 25.2 billion. Active parameters per inference: 3.8 billion. Context window: 256K tokens. It processes text and image inputs as well as video. Gemma 4 31B is the dense flagship. All 30.7 billion parameters fire on every forward pass. Same 256K context window. Same text and image support plus video processing. Google designed it for maximum quality and as a fine-tuning foundation.All four sizes ship with both base and instruction-tuned checkpoints on Hugging Face.
The Apache 2.0 Switch
Previous Gemma releases used a custom Google license that restricted certain commercial uses and imposed usage-based thresholds. Developers complained. Enterprise legal teams pushed back.
Gemma 4 ships under Apache 2.0. Zero usage caps, zero custom terms, and no separate agreement needed for commercial deployment. You can modify and distribute the models, sell products built on top of them, with the same freedoms as any other Apache-licensed project.
This puts Gemma 4 on the same licensing tier as Llama 4 (which uses Meta's custom open license with a 700M monthly active user threshold) and makes it more permissive than most competitors. For startups and enterprises evaluating open models, the licensing question now favors Google.
Benchmark Scores: The Full Picture
The numbers tell a dramatic story. Comparing Gemma 3 27B to Gemma 4 31B on identical benchmarks (as of April 2, 2026):
Reasoning and knowledge: MMLU Pro jumped from 67.6% to 85.2%. GPQA Diamond, a graduate-level reasoning test, went from 42.4% to 84.3%. BigBench Extra Hard rose from 19.3% to 74.4%. Mathematics: AIME 2026 scores moved from 20.8% to 89.2%. Competition-level math went from borderline failure to near-expert performance in one generation. Coding: LiveCodeBench v6 climbed from 29.1% to 80.0%. Codeforces ELO leaped from 110 to 2,150, equivalent to an expert competitive programmer rating. Vision: MATH-Vision rose from 46.0% to 85.6%. MMMU Pro went from 49.7% to 76.9%. The model now reads charts and diagrams with high accuracy, including handwritten equations. Long context: MRCR v2 at 128K average improved from 13.5% to 66.4%. Gemma 3 accepted long inputs but struggled to retrieve information from them. Gemma 4 does retrieval well at its full 256K window.The 26B MoE variant hits 97% of the dense model's scores across these benchmarks while activating a fraction of the parameters. Its LMArena text score reaches 1,441 compared to the 31B's 1,452.
The MoE Architecture: Why 3.8B Active Parameters Matter
Google took a different approach from Meta's Llama 4 Scout, which uses 16 large experts. Gemma 4's 26B-A4B packs 128 small experts and routes each token through 8 of them plus one shared always-on expert.
Inference costs drop because fewer parameters activate per token. Latency falls. You need less VRAM. The 26B model requires about 13 GB VRAM at Q8 quantization, putting it within reach of a 16 GB consumer GPU like the NVIDIA RTX 4060 Ti.
For comparison, running the full 31B dense model unquantized in bfloat16 requires a single 80 GB NVIDIA H100. Quantized versions fit on consumer hardware, but the MoE variant delivers 97% of that quality at a fraction of the memory footprint.
If you care about tokens-per-second on constrained hardware, the 26B-A4B is the model to watch.
On-Device Models: E2B and E4B
Google positioned the E2B and E4B models for mobile and edge deployment. The Pixel team collaborated with Qualcomm and MediaTek to optimize both models for smartphones, Raspberry Pi, and NVIDIA Jetson Orin Nano.
Both edge models support audio input, a capability the larger 26B and 31B variants lack. This makes them candidates for speech recognition and voice-driven interfaces on mobile devices.
The models use Per-Layer Embeddings (PLE), a technique from Gemma 3n that gives each decoder layer its own conditioning signal instead of relying on a single embedding vector. This adds per-layer specialization at low parameter cost.
A shared KV cache across the final layers reduces memory and compute during inference. Combined with the low active parameter counts, Google claims "near-zero latency" on supported devices.
Android developers can test these models now through the AICore Developer Preview, which provides forward-compatibility with Gemini Nano 4.
How Gemma 4 Compares to Llama 4 and Qwen 3.5
The open model field has three dominant families right now: Meta's Llama 4, Alibaba's Qwen 3.5, and Google's Gemma 4.
Context length: Llama 4 Scout offers 10 million tokens of context, dwarfing both Gemma 4 (256K) and Qwen 3.5 (also 256K). If your workload demands massive context windows, Llama 4 Scout remains unmatched. Efficiency per parameter: Gemma 4's 26B-A4B delivers frontier-class quality at 3.8B active parameters. Llama 4 Scout activates 17B of its 109B total. Gemma 4 wins the efficiency-per-active-parameter contest. Math and reasoning: Gemma 4 31B scores 89.2% on AIME 2026. In March benchmarks, Qwen 3.5-27B scored 48.7% on AIME 2025. The scoring conditions differ, but the gap is wide. Licensing: Gemma 4 uses Apache 2.0. Llama 4 uses Meta's custom license with commercial thresholds. Qwen 3.5 uses Apache 2.0. Google and Alibaba tie on licensing freedom; Meta trails. Multimodal: All three families handle images. Gemma 4's edge models add native audio. Llama 4 supports image and text. Qwen 3.5 handles image and audio through separate model variants, with video support in select checkpoints. VRAM requirements: Gemma 4 26B-A4B at Q8 needs about 13 GB. Qwen 3.5 35B at Q8 needs more VRAM at 100K context. Llama 4 Scout at full precision requires far more memory. Gemma 4's MoE architecture gives it an edge on consumer hardware.Who Should Use Gemma 4
Local-first developers who want offline code assistance: The 31B dense model or 26B MoE running on a workstation GPU replaces cloud API calls for code generation. Pair it with Cursor ($20/mo Pro) or a local IDE plugin for a self-hosted coding workflow. Codeforces ELO of 2,150 puts it in expert programmer territory. Mobile app builders targeting Android: The E2B and E4B models with audio support, combined with the AICore Developer Preview, create a path to on-device AI features without cloud dependencies. Startups shipping products on open models: Apache 2.0 means no licensing surprises at scale. No monthly active user caps. No custom legal review needed. Researchers fine-tuning for specialized tasks: The 31B dense model provides a strong foundation. Google highlights examples like INSAIT's Bulgarian language model and Yale's cancer therapy research built on previous Gemma generations. Teams running inference on consumer GPUs: The 26B MoE variant fits on 16 GB VRAM cards while delivering 97% of the flagship's quality. For budget-constrained deployments, this is the sweet spot.What to Watch For
Gemma 4's benchmarks look strong on paper. Independent community testing over the next few weeks will reveal how those scores translate to real-world tasks. Early Reddit discussions on r/LocalLLaMA note that Gemma 4 uses more system RAM at high context lengths than Qwen 3.5 35B under similar quantization settings.
The thinking mode, where Gemma 4 generates thousands of reasoning tokens before answering, drives many of the math and coding improvements. Tasks that benefit from chain-of-thought reasoning will see the biggest gains. Quick-answer tasks may not show the same leap.
Google has also released over 400 million Gemma downloads across all generations, with more than 100,000 community-built variants in the "Gemmaverse." That ecosystem momentum matters for long-term adoption and tooling support.
Getting Started: Tools and Costs
All Gemma 4 models are available now on Hugging Face. Here are the best ways to run them, with current pricing as of April 2026:
Ollama (free) offers the fastest path to testing. Install it, pull the Gemma 4 model matching your hardware, and start prompting. The 26B MoE variant gives you the best quality-per-VRAM ratio on consumer GPUs. Ollama is open source and costs nothing to run. Google AI Studio (free tier) lets you test Gemma 4 through a browser interface with no setup. Google AI Studio is free in all available regions. For production API calls, you pay per token through the Gemini Developer API. LM Studio (free) provides a desktop GUI for running Gemma 4 with one-click model downloads. Free for personal and commercial use. HuggingChat (free) gives you browser-based access to Gemma 4 without any local hardware requirements. Cursor ($20/mo Pro) pairs well with a local Gemma 4 instance for AI-assisted coding. With Gemma 4's Codeforces ELO of 2,150, the combination gives you expert-level code generation without cloud API costs beyond the Cursor subscription.For a broader look at local inference options, see our guide to the best local LLM tools.
For mobile development, the AICore Developer Preview provides the integration path for Android apps.
Gemma 4 represents Google's strongest argument that open models can compete with proprietary ones. The benchmarks support that claim. Whether the community's real-world testing confirms it will determine whether Gemma 4 becomes the default choice for local AI development in 2026.
Pricing and benchmark data as of April 2026. Gemma 4 models are free to download and use under the Apache 2.0 license. Pricing and features for third-party tools change frequently. Check official sites for current details.Master AI Agent Building
Get our comprehensive guide to building, deploying, and scaling AI agents for your business.
What you'll get:
- πStep-by-step setup instructions for 10+ agent platforms
- πPre-built templates for sales, support, and research agents
- πCost optimization strategies to reduce API spend by 50%
Get Instant Access
Join our newsletter and get this guide delivered to your inbox immediately.
We'll send you the download link instantly. Unsubscribe anytime.
π§ Tools Featured in This Article
Ready to get started? Here are the tools we recommend:
Gemini
Google's flagship AI assistant combining real-time web search, multimodal understanding, and native Google Workspace integration for productivity-focused users.
Google AI Studio
Google's free platform for experimenting with Gemini AI models, building prompts, prototyping multimodal applications, and generating API keys for production deployment.
Ollama
Run enterprise-grade language models locally with zero per-token costs, complete data privacy, and sub-100ms response times for AI agent development and deployment.
HuggingChat
Open-source AI chatbot with automatic model routing that intelligently selects from 15+ cutting-edge models including Llama 3.2, Command R+, and Mistral. Provides free access for ~20 daily messages and unlimited $9/month PRO tier with priority inference.
Llama Stack
Llama Stack: Meta's standardized API and toolchain for building AI agents with Llama models, providing inference, safety, memory, and tool use in a unified stack.
Cursor
AI-first code editor with autonomous coding capabilities. Understands your codebase and writes code collaboratively with you.
+ 1 more tools mentioned in this article
π Related Reading
Enjoyed this article?
Get weekly deep dives on AI agent tools, frameworks, and strategies delivered to your inbox.