Google Unveils New Underlying Architecture MoR: A Potential Transformer Replacement

07/25 2025 532

Preface: The unwieldy size and inefficiency of Large Language Models (LLMs) have long been a source of concern. Despite the proliferation of model parameters, issues such as degraded performance in long text processing and excessive computational resource consumption persist. Google DeepMind's newly proposed MoR architecture may offer a transformative solution to this dilemma.

Author | Fang Wensan

Image Source | Network

The Dilemma and Limitations of Traditional Models

For an extended period, the Transformer architecture has stood as the bedrock of large language models. However, as research advances, its inherent limitations have gradually come to light.

Transformer relies on stacking network layers to enhance model performance, which leads to a uniform distribution of computational resources regardless of the complexity of input tokens.

Both simple tokens (e.g., conjunctions, auxiliary words) and complex tokens (e.g., technical terms, long sentences) undergo identical processing, resulting in significant redundant calculations.

Furthermore, when handling long text sequences, its key-value caching (KV caching) mechanism demands substantial memory space, further impeding the model's efficiency.

In response, researchers are actively exploring two key avenues: enhancing parameter utilization efficiency through weight sharing mechanisms and dynamically allocating computational resources based on input complexity to achieve adaptive computing.

As model sizes balloon to hundreds of billions of parameters, training and inference costs have emerged as core bottlenecks hindering widespread adoption.

The traditional Transformer architecture's uniform computing approach for all input information reveals significant resource redundancy.

From Theory to Practice: The Potential to Replace Transformer

Acknowledging Transformer's limitations, numerous non-Transformer architectures have emerged, including China's RWKV, Meta's Mega, Microsoft Research Asia's Retnet, Mamba, and DeepMind's Hawk and Griffin.

Most of these architectures build upon RNNs, aiming to address Transformer's shortcomings and develop more efficient model structures.

Recently, teams from KAIST, Mila, and Google DeepMind introduced a groundbreaking new LLM architecture named Mixture-of-Recursions (MoR), hailed as a potential "Transformer killer" by the industry.

MoR achieves the unprecedented synergy of parameter sharing and adaptive computing within a single framework, surpassing traditional methods that could only adopt one approach or the other.

This framework integrates a dynamic token-level routing mechanism into a parameter-efficient recursive Transformer, forming a cohesive architecture poised to "achieve the quality of large models while mitigating their costs."

In essence, the MoR framework dynamically and precisely allocates computational resources based on each token's needs, ensuring efficient task completion while minimizing resource waste.

The MoR framework (Mixture-of-Recursions) is a unified architecture that fully leverages the capabilities of recursive Transformers, dynamically adjusting the recursion steps for each token during pre-training and inference.

The framework's core comprises two essential components: a lightweight routing mechanism and a KV caching strategy.

The lightweight routing mechanism incorporates an end-to-end trained lightweight router responsible for assigning a specific recursion depth to each token.

This allows the model to determine the frequency of recursive calls to shared parameter modules based on each token's required processing depth, thereby directing computational resources precisely where they are most needed.

Technically, MoR employs end-to-end training of the lightweight routing module to dynamically assign a unique recursion depth to each token.

This mechanism tailors the number of recursive applications of shared parameter modules to each token's required processing depth, enabling precise allocation of computational resources.

This token-based dynamic recursion mechanism inherently supports key-value (KV) caching at recursive levels.

The cache can selectively store and retrieve corresponding key-value pairs based on the recursion depth assigned to each token, drastically reducing memory bandwidth pressure and enhancing inference throughput without post-processing.

In summary, MoR achieves three key optimizations within a unified architecture: parameter sharing, computational routing, and recursive-level caching.

Moreover, adopting a KV cache sharing strategy marginally impacts performance but significantly boosts memory efficiency.

In deployment scenarios with limited memory resources, this trade-off between performance and resource consumption is deemed acceptable.

This means the model can allocate computational resources precisely based on each token's processing needs, avoiding redundant computational expenditure.

MoR substantially reduces validation set perplexity and improves few-shot accuracy while offering higher throughput compared to existing models with equivalent training computation and a smaller model size.

Its performance in tasks like few-shot learning and long text processing rivals that of Transformer but with superior computational efficiency, positioning it as a strong contender to replace the Transformer architecture.

Impressive Performance of MoR in Experimental Results

The research team tested multiple model sizes ranging from 135 million to 170 million parameters.

Results indicate that, under the same training computation budget, models utilizing the MoR architecture, despite having nearly half the parameters of the baseline Transformer model, achieved an average accuracy of 43.1% in multiple few-shot learning tasks, outperforming the baseline model's 42.3%.

Crucially, MoR's higher computational efficiency enables it to process more training data within the same computation budget, further enhancing model performance.

In comparison experiments with a fixed amount of training data, a specific MoR configuration outperformed the baseline model using only 75% of the baseline's training computation, while also reducing training time by 19% and peak memory usage by 25%.

MoR's advantages are even more pronounced in inference performance.

Its continuous depth batching technique combines tokens at different computation stages into the same batch for processing, as they share the same parameter block.

This technique, coupled with the model's early exit mechanism, significantly boosts processing throughput.

In a test of a 360 million-parameter model, the MoR-4 configuration achieved up to 2.06x inference acceleration under specific test conditions.

Notably, despite having nearly 50% fewer parameters, MoR still exhibited superior performance.

This advantage stems from its substantially improved computational efficiency, enabling it to process more training tokens within the same FLOPs budget.

MoR Represents a Fundamental Shift in LLM Development Logic

The emergence of MoR signifies a paradigm shift in AI models, evolving from [scale expansion] to [intelligent computing].

Its dynamic routing mechanism mimics the human cognitive trait of [selective attention], offering new insights into developing more biologically inspired AI systems.

Through the triple optimization mechanism of dynamic routing, parameter sharing, and intelligent caching, MoR redefines the efficiency boundaries of large models.

The groundbreaking advancements of doubling inference speed and halving memory usage not only significantly reduce deployment costs but also establish a new paradigm for tackling complex tasks.

While further exploration is needed in large-scale validation and multimodal extension, MoR has demonstrated substantial potential to replace Transformer, potentially leading the architectural innovation of the next generation of AI models.

Crucially, MoR lays the groundwork for developing more cognitively inspired AI systems.

The framework's ability to adaptively allocate [thinking depth] per token during generation aligns well with emerging research on the potential reasoning and internal thinking mechanisms of language models.

This suggests that MoR can serve as a key platform for exploring how models can gradually learn to delve deeper into complex problems while maintaining efficiency in routine tasks.

Conclusion:

MoR continues and deepens the exploration of AI efficiency optimization, transitioning from single-dimensional optimization to multi-dimensional synergistic optimization of parameters, computation, and memory.

This holds significant practical value for reducing the deployment and application costs of large language models.

Overall, while it is premature to assert that MoR will fully replace the Transformer architecture at this juncture, it undeniably provides a promising evolutionary direction for future language model design in terms of both performance and efficiency.

References:

Yan Yan Planet: "Google DeepMind Releases MoR Architecture, Doubles Inference Speed, Halves Memory, Potential Transformer Replacement"

Suanjia Cloud: "End of Transformer Hegemony? Google DeepMind Introduces Revolutionary Architecture: 2x Inference Speed, 50% Fewer Parameters"

AINLPer: "Google et al. Propose Recursive Mixture Framework: MoR, Significantly Boosting LLM Computational Efficiency"

AI Empire: "Google Releases MoR Architecture: 2x Inference Speed, 50% Memory Savings"

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.