Nested Learning: The Illusion of Deep Learning Architecture (Ali Behrouz)
Ali Behrouz, PhD student at Cornell and student researcher at Google, presented a sweeping unification of deep learning architectures and optimization algorithms under a single principle: both are associative memories, just applied in different contexts.
- Paper: Nested Learning: The Illusion of Deep Learning Architecture
- Also see: Miras framework, Titans architecture
- Blog posts: Nested Learning blog, Miras blog,
- Presenter: Ali Behrouz
The reason I was so excited to host this talk is that Ali’s work operationalizes something I’ve believed for quite a while now: Architecture = Objective + Optimization. That is, once you pick an objective (what are you optimizing?) and an optimizer (how do you achieve it?), the architecture falls out. Ali showed this goes beyond just a nice philosophy: it’s actually a recoverable, generative framework for novel architectures and optimizers. Every modern sequence model, from linear attention to Transformers to state-space models, can be derived as a specific choice within this design space. And crucially, large regions of this space remain completely unexplored.
Brain inspiration at the right level of abstraction
We opened with a short Q&A on whether Ali’s work counts as “brain-inspired.” He shared that the work drew polarized reactions from neuroscientists—some saying it matches exactly how they think the brain works, others saying it’s all wrong. Ali’s position is that brain inspiration should operate at the right level of abstraction: identifying the underlying rules and constraints the brain faces, without claiming the implementation details are identical. He also lamented that after ~2018, the field shifted heavily toward efficiency over effectiveness. His work deliberately invests more computation per artificial neuron, trading training efficiency for sample efficiency and richer internal representations.
Architectures as associative memory
The technical core of the talk reframed various modern architectures as solutions to an associative memory problem: given keys and values, learn a mapping between them by optimizing some objective. Ali showed that choosing dot-product similarity + gradient descent recovers linear attention; choosing L2 regression loss + gradient descent recovers the delta learning rule; and solving the problem non-parametrically recovers softmax attention.
The Miras framework formalizes this into four design choices: (i) memory architecture (vector, matrix, MLP, or deeper), (ii) attentional bias objective, (iii) retention gate, and (iv) learning algorithm. Most existing architectures cluster in a tiny corner of this space—vector or matrix memory, dot-product or L2 objective, gradient descent. The unexplored territory is vast.
On the optimizer side, Ali showed that backpropagation itself is a form of associative memory that maps input data to prediction errors, and that even the Adam optimizer emerges as the optimal solution to a specific objective that balances current gradients against a running global summary of past gradients. This means the same framework that designs architectures also designs optimizers. And since the optimizer’s context (gradients) is generated by the architecture, the two form an interconnected system that should be understood jointly.
Memorization vs. compression: where does learning actually happen?
During the meeting, Tyler Bonnen asked: all the formulations shown so far seem great for memorization—compressing tokens or contexts into memory—but where does abstraction come in? Where’s the compression that produces higher-level knowledge? Ali agreed entirely, noting that his team deliberately calls this process “test-time memorization” in the Miras paper. The answer, he argued, lies in knowledge transfer between levels of a nested system. When a single loop compresses its context, that’s memorization. But when that compressed knowledge is transferred to a higher-level loop—via meta-learning, backpropagation, or direct parameterization—it becomes learning, because the information is being abstracted across contexts. Different forms of knowledge transfer (initialization, direct connection, context generation) give rise to different known concepts: meta-learning, RNNs, Transformers, and hypernetworks, respectively. Tyler closed the session by telling Ali: “This is beautiful work. It’s changing the field. People are paying attention. Everybody’s excited. It’s really great to hear it from your perspective.”
The twin paradox and the spectrum of memory
Ali motivated his continuum memory system with a striking analogy to the twin paradox in special relativit. One twin travels near light speed and experiences only minutes; the other stays home and lives through eighty years. The twin who experienced minutes remembers every detail of their shared ice cream; the one who aged has long forgotten. In the same way, a high-frequency memory module that updates every token experiences rapid “time”—and may forget quickly—while a low-frequency module that updates every hundred thousand tokens barely ages at all, preserving information across vast stretches of context. This motivates the Hope architecture: a stack of MLP blocks updated at different frequencies, ranging from fast (every token) to slow (every hundred thousand tokens). Ali demonstrated that this design enables near-perfect needle-in-a-haystack retrieval at ten million tokens and, more impressively, enables continual in-context learning of two unseen languages simultaneously, a task where standard in-context learning collapses, but Hope with three frequency levels nearly recovers single-language performance.
Why discrete frequencies, not just different learning rates?
Hadi pushed on this design choice: why use discrete update frequencies rather than simply assigning different learning rates to different modules? Ali’s answer was precise: a smaller learning rate still processes every token, just with a gentler update. This means the model never pauses. But with discrete frequencies, the model can fully process one chunk, stop, and then approach the next chunk with full capacity. He gave the example of a long mathematical context where ten critical tokens contain the actual answer. With a small learning rate, those tokens get the same diminished update as everything else. With discrete frequency, the model can hit those tokens with a full-strength learning rate while the low-frequency modules preserve the broader context. Ali confirmed there is empirical evidence for this: in the M3 optimizer, having multiple momentum terms with different frequencies outperforms having multiple momentum terms with different learning rates.
Hadi noted that while the twin paradox analogy is beautiful, it may break in one respect: in general relativity, proper time is continuous (the time experienced by each observer that depends on their local spacetime geometry). Ali acknowledged the analogy isn’t perfect, and added an important caveat: the discreteness in his framework is partly an artifact of discrete tokens. If the input data were continuous, the update frequency could in principle be continuous too. He also noted that individual neurons could eventually have their own update frequencies, pushing toward a truly continuous spectrum.
Delta gradient descent and beyond
One novel implication Ali highlighted is delta gradient descent: replacing the dot-product similarity in the associative-memory formulation of gradient descent with L2 regression loss. This introduces an input-dependent adaptive weight decay that lets momentum drift when the loss landscape demands it. A toy example showed delta momentum finding the global minimum where standard momentum sails right past it.
Hadi asked whether the outer product of gradients in this formulation could be interpreted as approximating second-order curvature information. Ali noted there’s a debate in the optimization community about calling any first-order method a second-order approximation, but agreed the formulation captures more about the loss landscape geometry than standard momentum. Hadi pushed the interpretation further: the delta term doesn’t approximate the Hessian so much as redirect momentum into the orthogonal subspace of the gradient. This can be interpreted as a “solenoidal” motion through parameter space that feels deeply interesting but hard to articulate.
Broader implications
When I first encountered Ali’s work at his NeurIPS poster, I thouight to myself: this Nested Learning stuff is “test of time” material. And his presentation just reinforced that view. The reason is simple: this isn’t just another architecture paper. It’s a lens. A conceptual advance, a “way of thinking.” Once you see architectures and optimizers as the same thing—associative memories differing only in their context, objective, and update rule—you can’t unsee it. And the practical consequence is immediate: instead of proposing architectures based on heuristics and intuition, we now have a principled design space where you choose your objective, choose your optimizer, and the architecture writes itself.
The vast majority of this space is unexplored. If the field takes this framework seriously—and given the reception in the room, I think it will—we should expect a wave of systematically motivated architectural innovations rather than the usual pattern of heuristic tinkering followed by post-hoc rationalization.
Watch the full meeting here: