After years of dominance by transformers, the AI community is actively searching for new architectures to overcome their limitations.
Transformers are central to models like OpenAI’s Sora, Anthropic’s Claude, Google’s Gemini, and GPT-4. However, they face technical challenges, especially in terms of computation. Transformers are not very efficient at processing large amounts of data on standard hardware, leading to increases in power demand as companies scale their infrastructure.
Recently, researchers from Stanford, UC San Diego, UC Berkeley, and Meta proposed a new architecture called test-time training (TTT). Developed over a year and a half, TTT models promise to process more data than transformers while using much less compute power.
Feature | Transformers | TTT Models |
---|---|---|
Data Processing Efficiency | Inefficient on standard hardware | Highly efficient |
Power Consumption | High | Low |
Hidden State | Uses a long list of data | Replaces with a machine learning model |
Scalability | Limited by computational demands | Scales without increasing internal size |
A fundamental part of transformers is the “hidden state,” a long list of data that the model updates as it processes information. This is like a lookup table, which can be computationally demanding. In contrast, TTT models use an internal machine learning model that encodes data into representative variables called weights. This approach ensures the internal model size remains constant, regardless of the amount of data processed.
According to Yu Sun, a post-doc at Stanford and co-contributor to the TTT research, future TTT models could process billions of pieces of data, from words to images to videos, far beyond the capabilities of current models.
“Our system can say X words about a book without the computational complexity of rereading the book X times,” Sun explained. For example, video models like Sora can only process 10 seconds of video due to the limitations of their lookup table “brains.” The goal is to develop a system that can process long videos, resembling the visual experience of a human life.
While TTT models show promise, they are not yet a drop-in replacement for transformers. The researchers have only developed two small models, making it difficult to compare TTT to larger transformer implementations.
Mike Cook, a senior lecturer at King’s College London’s department of informatics, noted the innovation but remained cautious. He pointed out that adding another neural network layer is a familiar approach in computer science but doesn’t necessarily guarantee better performance.
The quest for transformer alternatives is gaining momentum. AI startup Mistral released a model called Codestral Mamba, based on state space models (SSMs), which also promise computational efficiency and scalability. AI21 Labs and Cartesia are also exploring SSMs, with Cartesia pioneering some of the first models.
If successful, these efforts could make generative AI more accessible and widespread, impacting various sectors and everyday life.
Featured Image courtesy of DALL-E by ChatGPT