The Transformer Isn't Dead — Its Monopoly Is

And the real race is no longer about who has more GPUs.

Every major AI model you use today — ChatGPT, Claude, Gemini, Grok — runs on the same fundamental architecture: the Transformer. Introduced in 2017 by Google’s “Attention Is All You Need” paper, it has dominated AI for nearly a decade with a single, elegant idea: let every word in a sequence attend to every other word simultaneously.

It worked. Brilliantly.

But that brilliance comes with a bill. Attention scales quadratically with sequence length. Double the context window, quadruple the compute. At 100,000 tokens, the cost becomes a genuine constraint. At a million tokens, it becomes prohibitive. And training a frontier model from scratch — the kind of brute-force scaling that brought us GPT-4 and Claude Opus — now costs hundreds of millions of dollars per run.

The AI industry’s current answer to this problem is simple: throw more hardware at it. More H100s. Bigger clusters. Larger datacenters. More power.

But what if the answer isn’t more GPUs? What if it’s a better architecture?

The challengers

Over the past two years, a family of alternative architectures has quietly matured from academic curiosity to production-ready reality. They share a common thesis: the Transformer’s quadratic attention mechanism isn’t just expensive — it’s unnecessary for many of the things we need AI to do.

Mamba, introduced in late 2023 by Albert Gu and Tri Dao, replaced attention entirely with selective state spaces — a mechanism borrowed from control theory that processes sequences in linear time. A Mamba-3B model outperformed Transformers of the same size and matched ones twice as large. By March 2026, Mamba reached version 3, published at ICLR 2026, with an inference-first design that achieves comparable perplexity to Mamba-2 using half the state size.

Titans, from Google Research (Ali Behrouz et al., December 2024), introduced a neural long-term memory module that learns to memorize based on surprise — events that violate expectations are stored more persistently. Presented at NeurIPS 2025, Titans scaled to context windows beyond 2 million tokens with better accuracy than Transformers on needle-in-a-haystack tasks.

Infini-Attention (Google, April 2024) extended the Transformer toward effectively infinite context by compressing past information into a persistent memory bank, suggesting that the Transformer might mutate rather than disappear.

Multi-token prediction (Meta, 2024) attacked a different assumption: instead of predicting one token at a time, predict several simultaneously. DeepSeek-V3 adopted this technique, and the efficiency gains were substantial.

None of these is a silver bullet. Each addresses a different limitation. But together, they paint a clear picture: the era of the Transformer as the only viable architecture is over.

From papers to production

This is no longer theory. Hybrid models — architectures that combine Transformer attention with SSM layers — are already shipping in production:

NVIDIA’s Nemotron-H replaced 92% of attention layers with Mamba2 blocks, delivering up to 3x throughput compared to pure Transformers like LLaMA-3.1 and Qwen-2.5, while matching or exceeding accuracy on standard benchmarks. Open-sourced.
AI21’s Jamba 1.5 scaled a hybrid Transformer-Mamba-MoE architecture to 398 billion total parameters with 94 billion active, supporting 256K-token context windows. The ratio: one Transformer layer for every seven Mamba layers.
Microsoft’s Phi-4-mini-flash-reasoning introduced SambaY, a decoder-hybrid-decoder architecture combining Mamba, sliding window attention, and a novel Gated Memory Unit. With 3.8 billion parameters, it achieved performance comparable to models twice its size — at 10x higher throughput.
IBM’s Bamba-9B reduced model size from 18GB to 9GB via quantization while maintaining performance comparable to LLaMA-3.1 8B.

The pattern is consistent: hybrid architectures match Transformer accuracy at a fraction of the inference cost, especially for long sequences. The consensus forming in the research community is not “Transformers vs. SSMs” — it’s “how much attention do you actually need, and where?”

The deeper shift: learning that doesn’t stop

Faster inference and cheaper training are important. But the most radical line of research points somewhere else entirely.

Today’s language models have a fundamental limitation: they are static after training. Their knowledge freezes at a cutoff date. Their weights don’t update when you use them. Every conversation starts from the same frozen snapshot of the world. This is like having a colleague with a perfect memory of everything they read in school — but who hasn’t learned a single thing since graduation.

Nested Learning, published by Google Research at NeurIPS 2025 (Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni — the same team behind Titans), proposes something heretical: the distinction between a model’s architecture and its training algorithm is an illusion. They are the same thing — nested levels of optimization, each with its own flow of information and update frequency.

The practical consequence: you can design models with a continuum memory system — modules that update at different rates. Some update with every token (fast, working memory). Others update slowly, consolidating knowledge over thousands of steps (long-term memory). The model doesn’t just process data — it continuously learns from it, at multiple timescales simultaneously.

Their proof-of-concept architecture, Hope, is a self-modifying recurrent model that can literally learn its own update rules during inference. It outperformed Transformers and Titans on language modeling, common-sense reasoning, and long-context tasks.

Then in May 2025, the same team released ATLAS, which introduced DeepTransformers — a strict generalization of the original Transformer architecture with optimized memory. ATLAS achieved over 80% accuracy at 10 million tokens of context on the BABILong benchmark. Ten million tokens. That’s roughly 15,000 pages of text.

This is one team, inside Google Research, publishing three papers in one year, each building on the last. This isn’t scattered academic output. This is a research program.

The concept that changes the economics

Here’s where this gets interesting for anyone who cares about the business of AI — which should be everyone.

Training a frontier model today is an event. A massive, concentrated burn of compute that costs hundreds of millions of dollars and takes months. If you want a better model, you largely start over. Each improvement requires another enormous upfront investment.

Nested Learning suggests a different model: distribute the learning over time. Instead of burning all your compute upfront in a single training run, make the model improve continuously as it operates. Every inference cycle becomes a small learning step. The cost of improvement shifts from a massive capital expenditure to a distributed operational flow.

This doesn’t eliminate training. You still need a strong base model. But it fundamentally changes the economics of keeping that model current, relevant, and improving.

And this concept — learning through iteration, not just execution — is already showing up in products, even if the underlying implementation isn’t yet using these architectures directly.

Where it’s already happening

Google’s Jitro — the internal codename for Jules V2, their next-generation coding agent — was revealed days ago. Its positioning: “Manually prompting your agents is so… 2025.” Instead of defining specific tasks, developers set high-level goals — improve test coverage, reduce latency, increase accessibility compliance — and the agent autonomously identifies what needs to change in the codebase and iterates toward the target. It has its own persistent workspace. It maintains goals, insights, and update histories. It doesn’t execute once and forget — it operates in a loop, building on previous iterations.

Zhipu’s GLM-5.1, released two days ago, takes this further. The model can autonomously handle a single coding task for up to eight hours — planning, executing, testing, and optimizing in a continuous loop. Their technical paper describes novel asynchronous Agent RL algorithms specifically designed for learning from long-horizon interactions. The model was trained entirely on Huawei Ascend chips — zero NVIDIA hardware — and its API costs roughly 5-8x less than comparable Western frontier models.

Neither of these products is confirmed to use Nested Learning or Hope architectures under the hood. They may well be using Transformers with sophisticated scaffolding — agent frameworks, tool chains, external state databases. But the concept is the same: models that iterate, remember, and self-correct over time, rather than models that respond to a single prompt and forget.

The scaffolding approach works. But it’s brittle and expensive. An architecture that does this natively — that learns continuously by design rather than by external engineering — would be fundamentally more efficient. And that’s exactly what Google Research is building.

The ai-2027.com connection

This trajectory aligns precisely with the scenario mapped by ai-2027.com — the detailed AGI roadmap by ex-OpenAI researcher Daniel Kokotajlo and Scott Alexander. Their timeline describes AI agents that progressively automate AI research itself: each generation of agent helps build the next, faster and cheaper.

By mid-2026 in their scenario, AI achieves a 1.5x research multiplier — one week of agent-assisted work produces what previously took 1.5 weeks. By March 2027, “superhuman coders” emerge. By late 2027, the multiplier hits 50x.

For that scenario to materialize, you need exactly what Nested Learning describes: models that don’t just execute instructions but learn from their own iterations. Models where every cycle of work makes the next cycle slightly better. Models that close the loop between action and improvement.

The companies that solve this first — that make every inference cycle count as learning — are the ones that can ride the exponential. The ones still burning hundreds of millions per training run are buying lottery tickets.

Who’s positioned and who’s exposed

Google has both the fundamental research (Behrouz’s team: Titans → Nested Learning → Hope → ATLAS) and the products that need it (Jitro, Gemini). They have patient capital, infrastructure ownership, and no existential pressure to monetize every breakthrough immediately. They can afford to let this research mature.

Chinese labs — particularly Zhipu (GLM-5.1) and DeepSeek — are proving that constraint breeds innovation. Training on domestic hardware at a fraction of Western costs, they’re reaching 95% of frontier performance at 15% of the price. GLM-5.1’s 8-hour autonomous coding loops are not a gimmick — they’re a demonstration that continuous operation is viable today.

OpenAI and Anthropic remain focused on scaling the Transformer paradigm. Bigger models, more compute, higher subscription prices. This works as long as brute force stays ahead of efficiency. But as model quality converges — and it is converging — the advantage shifts from who has the most H100s to who has the best architecture.

The real frontier is no longer the best benchmark score. It’s the best learning loop.

What this means for you

If you’re paying $20/month for an AI subscription, here’s what matters: the model you’re using today was frozen months ago. It doesn’t learn from your conversations. It doesn’t improve from its mistakes. Every session starts from zero.

The next generation of AI won’t work that way. Models that learn continuously, that improve through use, that distribute their training cost across time instead of concentrating it in a single massive burn — these are coming. They’ll be faster, cheaper to run, and more capable over time rather than static.

The question is who builds them first, and whether you’ll need to pay $200/month for what should cost $20 — or whether competition from Chinese labs forces the pricing to reflect the actual economics.

We’ll be watching. That’s what Fridays are for.

This is the first edition of The Frontier View’s Friday series — a weekly look at the research and applications shaping AI’s next chapter. Wednesday and Sunday posts continue with our usual editorial analysis.

Sources

Papers referenced:

Mamba (Gu & Dao, 2023): arXiv 2312.00752
Mamba-2 / State Space Duality (Dao & Gu, 2024): arXiv 2405.21060
Mamba-3 (Lahoti et al., 2026): arXiv 2603.15569 — ICLR 2026
Titans (Behrouz et al., 2024): arXiv 2501.00663 — NeurIPS 2025
Nested Learning (Behrouz et al., 2025): arXiv 2512.24695 — NeurIPS 2025
ATLAS (Behrouz et al., 2025): arXiv 2505.23735
Infini-Attention (Google, 2024): arXiv 2404.07143
Multi-token Prediction (Meta, 2024): arXiv 2404.19737
DeepSeek-V3 (2024): arXiv 2412.19437
GLM-5 Technical Report (Zhipu/Tsinghua, 2026): arXiv 2602.15763

Products and announcements:

Jitro / Jules V2: testingcatalog.com, April 6, 2026
GLM-5.1: techbriefly.com, April 8, 2026
Nemotron-H: NVIDIA, open-sourced via Hugging Face
Jamba 1.5: AI21 Labs
Phi-4-mini-flash-reasoning / SambaY: Microsoft, July 2025
ai-2027.com scenario: Daniel Kokotajlo & Scott Alexander