Every major open-weights model release in the past three years has been a variation on the same architecture: autoregressive transformer, predict one token at a time, left to right. DiffusionGemma-26B, released by Google DeepMind on June 10 under Apache 2.0, is the first serious break from that pattern to come out of a major lab.

The headline number is 1,000+ tokens per second on an H100. That’s 4-5x faster than comparable autoregressive models on the same hardware. But the more interesting question is why — because the answer changes how you think about what this model is good for.

How Autoregressive Generation Creates a Latency Floor

To understand what’s new, you need to understand what’s old. Autoregressive transformers have a structural constraint: each token depends on all previous tokens. You cannot generate token N until tokens 1 through N-1 are complete. The computation is inherently sequential.

This creates a hard latency floor. No matter how much hardware you throw at the problem, the per-token generation time has a minimum. For a 1,000-token output at 250 tokens/sec, you’re waiting 4 seconds minimum — even with unlimited compute.

The other structural limitation: autoregressive models can’t revise. Once a token is generated, it’s committed. If the model starts down the wrong path in token 3, it can try to recover in tokens 4-100, but it can’t go back. This is why you sometimes see frontier models produce responses that start coherently and then degrade — the early tokens painted the model into a corner.

Uniform State Diffusion: Generating a Canvas, Not a Stream

DiffusionGemma works differently. Here’s the mechanism:

  1. Instead of generating token 1, then token 2, then token 3… the model opens a “canvas” — a fixed-length sequence (up to 256 tokens) filled with random placeholder tokens.

  2. It then runs a series of “denoising passes” over the entire canvas simultaneously. In each pass, the model looks at all positions at once (bidirectional attention) and locks in the tokens it’s most confident about.

  3. After each pass, some tokens are fixed (“frozen”) and the uncertain ones continue to be refined in subsequent passes.

  4. This continues until all tokens are frozen — producing the full output.

The key insight: bidirectional attention means the model can see the entire output when refining any individual token. It can revise its early decisions based on what it decided to generate later. Autoregressive models fundamentally cannot do this.

The speed comes from the parallelism: in each pass, the model locks in roughly 15-20 confident tokens simultaneously rather than one at a time.

The Technical Specs

DiffusionGemma-26B-A4B is a 26B parameter Mixture of Experts model that activates only 3.8B parameters per forward pass. The “A4B” in the name stands for “Active 4 Billion” — this is the inference-time compute you’re actually using.

Key numbers:

  • H100: 1,000+ tokens/sec
  • RTX 5090: 700+ tokens/sec
  • VRAM: 18GB for the quantized version (fits on a single consumer GPU)
  • Context window: 256K tokens
  • Input modalities: text, image, video (interleaved)
  • License: Apache 2.0

The 18GB VRAM footprint is significant. It means this runs on a single high-end consumer card (RTX 4090, RTX 5090) without quantization tricks. For organizations that want to run inference locally for data privacy reasons, the hardware requirements are reasonable.

What It’s Actually Good For

The quality caveat is real: overall quality is lower than standard Gemma 4 autoregressive models. Google’s own benchmarks show this. Don’t reach for DiffusionGemma for tasks that require sustained multi-step reasoning or complex instruction following where you’re competing with Gemini 2.5 Pro.

Where the architecture shines:

Code infilling and in-context editing. This is the killer use case. When you’re filling in code within an existing context — “complete this function given the surrounding code” — bidirectional attention is genuinely better than autoregressive. The model can see what comes before and after the gap, which is exactly the information you want for infilling. Copilot-style middle-of-file edits benefit more from this than start-of-file generation.

Constrained generation. When the output format is partially determined — you know the structure of what you need and just want the model to fill in specific fields — diffusion models handle this more naturally. The canvas starts with your known tokens and the model fills in the rest.

High-throughput batch processing. If you’re running a pipeline that processes thousands of similar inputs, the throughput advantage compounds. At 1,000 tokens/sec vs. 250 tokens/sec, you’re processing 4x the volume with the same hardware budget.

Latency-sensitive interactive applications. At 1,000 tokens/sec, a 200-token response (a typical API reply) arrives in 0.2 seconds. That’s below human perception threshold for “waiting.” For tools where the AI response is part of a tight feedback loop, this changes the UX calculation significantly.

Running It

The model is on Hugging Face at google/diffusiongemma-26B-A4B-it. The inference API is different from autoregressive models — you don’t sample token by token, you run the denoising loop:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/diffusiongemma-26B-A4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generation uses num_diffusion_steps instead of max_new_tokens
inputs = tokenizer("Complete this function:\ndef binary_search(arr, target):", return_tensors="pt")
outputs = model.generate(
    **inputs,
    num_diffusion_steps=20,   # more steps = better quality, slower
    max_length=256,
    do_sample=False
)
print(tokenizer.decode(outputs[0]))

The num_diffusion_steps parameter is the main quality/speed trade-off lever. More steps produce higher quality output but reduce throughput. The default of 20 steps is a reasonable starting point; for code generation you might push to 30-40 for better quality.

The Architectural Implication for Developer Tooling

The split between autoregressive and diffusion inference is going to matter for how you architect AI-powered developer tools in the next 12 months.

For features that need maximum reasoning quality — explaining complex code, architecture reviews, multi-file refactors — stick with autoregressive models (Gemma 4, Fable 5, etc.). The quality advantage is real.

For features where speed and throughput dominate — autocomplete, inline code infilling, real-time suggestions as you type — DiffusionGemma’s latency profile makes it competitive with much larger models that you’d have to host at greater cost.

The interesting middle ground: code linting and style suggestions, where you’re processing the current file continuously as the developer types. At 0.2 seconds per response, you can give feedback in real time without the “thinking…” lag that makes current AI linting feel awkward.

Google’s decision to open-source this under Apache 2.0 is also notable. The open-weights text diffusion space has been thin compared to image diffusion — there are solid open-weights image diffusion models (Stable Diffusion, FLUX) but nothing comparable for text. DiffusionGemma-26B changes that, and the Apache 2.0 license means commercial use is unrestricted.

Watch the quality gap. The architecture is sound, the throughput advantage is real, and the open-weights availability is good for the ecosystem. If Google closes the quality gap in the next iteration, this becomes genuinely competitive with autoregressive models across a wider range of tasks.

Export for reading

Comments