π§ͺ Tiny Nested Learning Example
A hands-on comparison of Transformers vs HOPE
π Note
HOPE is not an acronym.
It is the name of a recurrent architecture introduced by Googleβs Nested Learning research.HOPE models use a Continuum Memory System (CMS) composed of fast, medium, and slow memory tracks.
Each track updates at a different timescale, allowing the model to learn new information while preserving long-term stability.This structure helps HOPE reduce catastrophic forgetting compared to architectures that rely on a single shared state
(such as Transformers or standard RNNs).The name βHOPEβ reflects the goal of achieving hopeful, continual learning β
retaining older knowledge while integrating new tasks.
This project is a compact, intuitive demonstration of continual learning β how a machine learning model behaves when it learns Task A and then Task B, and whether it forgets what it learned earlier.
It implements a simplified version of ideas from Googleβs Nested Learning research and compares:
- a tiny Transformer encoder (baseline attention-only learner)
- a tiny HOPE-inspired recurrent model using a continuum memory system (CMS) with fast, medium, and slow update timescales
Both models are trained on two tiny natural-language tasks:
- Task 0 β Catch a train
- Task 1 β Catch a flight
By observing how much each model remembers Task 0 after learning Task 1, we get a clear, human-readable demonstration of catastrophic forgetting and how multi-timescale memory can mitigate it.
π Setup
cd tiny-nested-learning
docker compose build --no-cache
docker compose up
π― What the Script Does
- Builds two tiny text tasks from short English stories.
- Trains the Transformer on Task 0 β Task 1.
- Trains the HOPE model on the same sequence.
- Prints color-coded retention tables to show forgetting vs. retention.
π The Two Tiny Tasks
Task 0 β Catch a Train
i walk to the station with my ticket
i wait on the platform for the blue train
i find my seat and watch trees go by
Task 1 β Catch a Flight
i take a cab to the busy airport
i wait in a long line at the gate
i find my seat and watch clouds go by
π§ Architecture Overview (with Mermaid Visuals)
π΅ Transformer β One Shared Memory System
flowchart LR
subgraph Transformer
T0[Token IDs] --> TE[Embedding Layer]
TE --> MH[Multi-Head Attention]
MH --> LN1[LayerNorm + Residual]
LN1 --> FFN[Feed Forward Network]
FFN --> LN2[LayerNorm + Residual]
LN2 --> LOGITS[Token Logits]
end
π’ HOPE β Multi-Timescale Memory (Fast / Medium / Slow)
flowchart LR
subgraph HOPE
T0[Token IDs] --> EMB[Embedding Layer]
EMB --> CMS[CMS Cell<br/>Controller + Rate Adapter]
CMS --> F[Fast Memory<br/>Update ~ 0.6]
CMS --> M[Medium Memory<br/>Update ~ 0.3]
CMS --> S[Slow Memory<br/>Update ~ 0.02]
F & M & S --> AGG[Aggregated State]
AGG --> LOGITS[Token Logits]
end
π§ HOPE Memory Update β Sequence Diagram
sequenceDiagram
autonumber
participant X as Input Token<br/>x_t
participant C as Controller
participant R as Rate Adapter
participant F as Fast Memory
participant M as Medium Memory
participant S as Slow Memory
participant A as Aggregator
participant O as Output Logits
X->>C: Token embedding
C->>C: Compute candidate<br/>state h_candidate
C->>R: Send candidate for<br/>rate computation
R->>F: Ξ±_fast
R->>M: Ξ±_med
R->>S: Ξ±_slow
Note over F: Update rule:<br/>(1 - Ξ±_fast)*old<br/> + Ξ±_fast*h_candidate
Note over M: Update rule:<br/>(1 - Ξ±_med)*old<br/> + Ξ±_med*h_candidate
Note over S: Update rule:<br/>(1 - Ξ±_slow)*old<br/> + Ξ±_slow*h_candidate
F->>A: Updated fast memory
M->>A: Updated med memory
S->>A: Updated slow memory
A->>A: Combine memories<br/>h_combined
A->>O: Produce logits<br/>next-word prediction
π Example Results β Including REAL Output
==================== TRAINING TINY_HOPE ====================
Epochs per task: 3, Batch size: 64
π Training tiny_hope
β Starting Task_0
β Finished Task_0
Evaluating retention after Task_0...
π Training tiny_hope
β Starting Task_1
β Finished Task_1
Evaluating retention after Task_1...
β Completed all tasks for tiny_hope
π Transformer Retention Table
==================== TRANSFORMER RETENTION ====================
Evaluation Task | After Task_0 | After Task_1 | Forgetting
---------------------------------------------------------------------------
Task_0 | 0.975 | 0.800 | -0.175
Task_1 | 0.525 | 1.000 | 0.475
===========================================================================
π HOPE Retention Table
==================== HOPE RETENTION ====================
Evaluation Task | After Task_0 | After Task_1 | Forgetting
---------------------------------------------------------------------------
Task_0 | 0.575 | 0.700 | 0.125
Task_1 | 0.325 | 0.625 | 0.300
===========================================================================
β Continual Learning Summary
Transformer forget: -0.175
HOPE forget: 0.125
π HOPE retained more memory (less forgetting).
π Educational Explanation: Understanding the Results
Catastrophic forgetting occurs when a model learns Task 1 and overwrites what it learned in Task 0.
Final accuracy alone is misleading.
The correct metric in continual learning is:
FORGETTING = Final Accuracy β Start Accuracy
A model with lower forgetting (closer to zero or positive) is the better continual learner.
In this experiment:
- The Transformer forgot 17.5% of Task_0.
- HOPE actually improved on Task_0 by 12.5%, meaning zero catastrophic forgetting.
This happens because HOPE uses multi-timescale memory:
- Fast memory β adapts quickly
- Medium memory β blends
- Slow memory β preserves long-term knowledge
This mirrors Google's Nested Learning idea:
Learning at multiple speeds protects older knowledge while adapting to new tasks.
β Key Takeaway
The better continual learner is the one that FORGETS LESS.
HOPE wins this experiment.
π Why HOPE Works Better Here
- Slow memory barely changes β protects Task 0
- Fast memory absorbs Task 1 quickly β lower interference
- Medium memory blends both patterns
- Transformer updates one shared weight space, overwriting earlier information
HOPE demonstrates how multi-timescale memory can significantly reduce catastrophic forgetting.
β οΈ Disclaimer β HOPE Can Also Forget More
To be scientifically honest:
HOPE can forget more than a Transformer if:
- fast memory rate is too high
- slow memory is not slow enough
- tasks are extremely different
- the model is very tiny
- training runs too long
Example bad setting:
fast = 0.95
medium = 0.50
slow = 0.10
Produces retention like:
Transformer retains: 0.82
HOPE retains: 0.40
This demonstrates:
Multiβtimescale memory is powerful only when tuned properly.
π§ Customize & Explore
Try:
- Adjusting HOPE update rates
- Adding more tasks (Bus β Flight β Metro β Boat)
- Increasing vocabulary size
- Changing Transformer depth
- Lowering Task 1 epochs to reduce destructive updates
π RNN vs Transformer vs HOPE (Quick Comparison)
| Model | Memory Type | Strengths | Weaknesses |
|---|---|---|---|
| RNN | One hidden state | Simple, sequential | Severe forgetting |
| Transformer | Shared parameter memory | Strong modeling power | High forgetting when fine-tuned |
| HOPE | Fast + Medium + Slow | Protects old tasks via slow memory | Needs tuning |
π§ Why βNested Learningβ?
Traditional models update one memory system.
Nested Learning updates multiple memory systems simultaneously, each at a different speed:
- Fast β immediate adaptation
- Medium β shortβterm consolidation
- Slow β longβterm stability
HOPE is a small but functional example of this idea.
π Final Notes
This project is deliberately tiny β small enough to understand deeply, but powerful enough to illustrate the most important concepts in continual learning:
- Catastrophic forgetting
- Multi-timescale memory
- Nested Learning
- Transformer vs HOPE behavior
Use it as a learning tool, demo, or foundation for larger experiments.