08/25 2025
409
Author|Mao Xinru
In a single leap, Xingdong Jiyuan's L7 set a new world record for humanoid robot high jump at 95.641 cm.
With its 171 cm height and 65 kg weight, even an average person might struggle to achieve such a high and precise Mario jump.
Despite the numerous "roll over" moments that garnered significant attention at this year's World Humanoid Robot Games, it cannot be overlooked that these events, including running, high jump, and long jump, rigorously tested the robots' ability to achieve a high degree of synergy between algorithms and hardware.
Meanwhile, Unitree Technology, which dominated this year's games, saw its founder Wang Xingxing's speech at the World Robot Conference Forum labeled as "explosive" or even "radical" for questioning the currently popular VLA (Vision-Language-Action) approach.
Chen Jianyu, the founder of Xingdong Jiyuan, another champion team, expressed a different attitude towards VLA compared to Wang Xingxing.
These differing viewpoints reflect the two companies' distinct practical paths towards enhancing robot capabilities—one emphasizing "hardware first," the other favoring "software-hardware integration and vertical integration."
Divergence in Vertical Integration and Hardware-First Concepts
The founders' different backgrounds somewhat foreshadowed the different development directions for the two companies.
Wang Xingxing embodies a typical engineer's mindset, leading Unitree Technology to adopt a "hardware first" approach. On the other hand, Chen Jianyu, a professor at Tsinghua University's Institute for Interdisciplinary Information Sciences, has a more scientific perspective, prompting Xingdong Jiyuan to choose a vertical integration route that integrates software and hardware.
The most significant divergence in their views lies in their assessment of the feasibility of VLA.
Chen Jianyu sees VLA as a broad paradigm encompassing any model that integrates vision, language, and behavior for execution in the physical world.
He believes that with the integration of generative world models and reinforcement learning, the capabilities of end-to-end methods are gradually being proven.
Thus, Xingdong Jiyuan continues to invest in integrated research and development of software and hardware, end-to-end VLA, reinforcement learning, and world models, releasing the end-to-end native robot large model ERA-42 last year.
Wang Xingxing, however, is skeptical of the currently popular VLA approach in robotics and prefers to allocate more resources to the "world model/video-driven" route.
He believes that a model that merely concatenates vision, language, and action without a stable world representation and prediction capability will reveal shortcomings when interacting with the real world, such as excessive dependence on data quality and diversity, and insufficient long-term planning and causal reasoning abilities.
Furthermore, the two companies have distinct differences in their weighting of "model-data-hardware."
Chen Jianyu insists that the model architecture comes first, but the diversity and quality of data, as well as hardware design, are equally crucial, together determining the upper limit of robot performance. Therefore, Xingdong Jiyuan adopts a software-hardware integration approach, advancing simultaneously.
He also views how to achieve training goals with less real machine data as an important engineering problem and has designed a two-stage training strategy combining a data pyramid with "pre-training + real machine fine-tuning."
Wang Xingxing emphasizes that the model determines the data, meaning that the model paradigm must be clarified first to avoid wasting resources on ineffective data collection or hardware. He believes that model design is still a bottleneck at this stage, and insufficient model capabilities can lead to a blind pursuit of data volume or variety.
Similarly, Chen Jianyu and Wang Xingxing have differing focuses regarding "open source and ecosystem."
Chen Jianyu values the synergy brought by the open-source ecosystem. Xingdong Jiyuan has open-sourced achievements such as the Humanoid Gym reinforcement learning framework for humanoid robots and the generative large model VPP, believing that open source can drive ecosystem prosperity and benefit from community iterations.
Wang Xingxing focuses more on building reusable data and model resources, as well as engineering implementations for large-scale distributed computing power, which is more about "how to make models replicable across multiple robots and scenarios."
Finally, regarding the pace of commercialization, the two have different judgments on short-term implementation.
Chen Jianyu prefers a path that prioritizes B-end scenarios, gradually transitioning to household scenarios. He revealed that Xingdong Jiyuan has already deployed in some real industrial scenarios, achieving over 70% of human efficiency and expecting to reach around 90% next year.
This choice is based on a pragmatic consideration of technology maturity and market acceptance, aligning with the iterative needs of software-hardware integration technology.
Wang Xingxing, on the other hand, has adopted a more diversified commercialization strategy. He openly admits that Unitree Technology's robots are currently mainly used for performances and fighting competitions because their capabilities for practical work are indeed not yet up to par.
This choice is based on a clear understanding of the current stage of technological development. Since robots are temporarily unable to handle complex practical work, it is better to accumulate technology, funds, and market attention in entertainment and demonstration scenarios, waiting for the arrival of a technological inflection point.
It is worth mentioning that the two companies are also at different stages in their commercial development.
Unitree is on the brink of an IPO and needs strategies for "small steps and fast runs" to maintain economic support. For example, Unitree recently launched the new humanoid robot R1, the quadruped robot dog A2, and also previewed a full-size humanoid robot.
Xingdong Jiyuan, on the other hand, has built a full-stack system this year encompassing "humanoid robots, service robots, dexterous hands, and large robot models."
Distinguishing from the "Unitree Model" of End-to-End Closed-Loop Architecture
Unlike Unitree's "hardware first" approach, Xingdong Jiyuan follows a route of "software-hardware integration, end-to-end VLA + reinforcement learning + world model fusion."
The Unitree model emphasizes the core position of self-developed hardware, laying a high-performance foundation for the robot's body by increasing joint motor torque and optimizing mechanical structures.
Xingdong Jiyuan, on the other hand, tends to view hardware and software as an integrated system, believing that only through bidirectional deep coupling can the maximum potential of humanoid robots be unleashed in complex environments.
In fact, the nature of humanoid robot operations has already determined the need for coupled software and hardware development.
To complete tasks such as grasping, lifting, and walking in complex and dynamic real-world environments, humanoid robots rely on both complex perception and high-bandwidth motion execution.
Focusing solely on the "brain" or the "body" is insufficient to form a feasible closed loop. Only by closing the engineering link of "perception-decision-execution" and continuously iterating can stable performance be maintained in the complex real world.
Secondly, end-to-end instant feedback and high-frequency control also have significant advantages. The traditional phased "perception-planning-control" architecture suffers from stage delays and information loss, making it difficult to achieve a human-like "see-and-act immediately, real-time correction" feedback loop.
The end-to-end strategy can couple "vision-language-action" within a single learning entity, enabling robots to quickly adapt to sudden disturbances, especially in tasks requiring high-frequency, fine-grained actions.
From a commercialization perspective, companies that fully bet on "building the brain first" will face lengthy delivery cycles. In contrast, adopting a software-hardware integration and parallel advancement strategy allows for valuable data and engineering feedback to be obtained from customers and implementation scenarios, which in turn benefits model iteration.
Xingdong Jiyuan has already achieved implementation in the domestic B-end market, verifying its products through real-world scenario data and forming a technological closed loop, thus balancing research progress and commercial needs.
In the process of transforming abstract theories into engineering practices, Xingdong Jiyuan has constructed a technical system with a total of five levels from bottom to top:
The hardware layer includes self-developed joint modules, direct-drive motors, reducers, dexterous hands, etc. Taking the high jump champion L7 as an example, its joint peak torque, rotation speed, and degrees of freedom all reflect a design orientation for high dynamic movements. Self-developed hardware not only aims for high performance but also for obtaining higher-quality, repeatable real machine training data.
The real-time control layer encompasses low-latency drive, joint-level high-frequency controllers, and online dynamics solving modules.
Xingdong Jiyuan uses both traditional control theory for stability assurance and integrates reinforcement learning to learn high-dimensional motion strategies at this layer. This hybrid strategy expands the upper limit of learnable actions while ensuring safe and stable robot movement.
The perception and world model layer integrates multimodal perceptions such as vision, touch, and depth, and runs a generative world model. Currently, Xingdong Jiyuan is experimenting with combining generative models with world models for future prediction, cognition, and behavior generation, essentially using models to imagine the future and generate actions accordingly.
The collaborative end-to-end VLA large model ERA-42 and high-level decision-making layer integrate vision, language, and action into an end-to-end strategy, covering the closed loop from scene understanding, task parsing to action output.
High-level strategies can leverage pre-trained visual-language models and generative modules, fine-tuned through reinforcement learning on real machines to achieve task specialization.
The data engineering and training platform layer comprises simulation environments, data annotation, and distributed training clusters. These layers are chained together, forming a "closed-loop accelerator" from hardware to models, from simulation to real machines, and from open source to commercial scenarios.
With the support of this system, Xingdong Jiyuan has achieved phased engineering results:
L7's victory in the high jump is a testament to its engineering capabilities in "dynamic design, joint control, and algorithm integration."
The dexterous hand has entered a stable mass production phase, with significant improvements in cost and stability. Integrated with VLA control, it achieves high-frequency, fine-grained finger control, transitioning from a laboratory prototype to an industrial-grade product.
Papers on reinforcement learning for motion control, world model fusion, and generative VLA have been published, and projects such as Humanoid Gym and VPP have been open-sourced, promoting industry collaboration.
Commercial validation has been completed in industrial scenarios such as warehousing, handling, inspection, and entertainment demonstrations. Over 300 products have been delivered this year, with hundreds more orders in mass production. Nine of the top ten technology companies globally are its customers.
Xingdong Jiyuan's closed-loop system and its achievements demonstrate one possibility for technology implementation. This is also a microcosm of the current industry's "hundred schools of thought."
A Shared Belief Behind the Differences
When broadening our perspective from the practice of a single company to the entire industry, it becomes apparent that behind Chen Jianyu and Wang Xingxing's seemingly different choices lies a shared belief in the future of humanoid robots within the industry.
Despite their obvious differences in technical paths and business strategies, Chen Jianyu and Wang Xingxing share a high degree of consensus on some fundamental issues.
They both believe that humanoid robots are one of the ultimate carriers of AI technology, capable of profoundly influencing human society's production and lifestyle.
Wang Xingxing predicts that the "ChatGPT moment" for humanoid robots is approaching, possibly within one to two years or up to five years. By then, robots will be able to understand and execute various complex instructions in a completely unfamiliar environment.
Chen Jianyu also agrees with the gradual development path from machine workers to household companions, believing that "the ultimate killer application will definitely be in the home."
In terms of comprehending the core of technology, they share a common view: embodied intelligence is fundamentally a closed loop of "perception-decision-execution," rather than merely a breakthrough in software or hardware.
Wang Xingxing emphasizes that the pivotal aspect of robots lies in AI, not the physical body itself. This does not diminish the importance of hardware; rather, it highlights that intelligence levels are currently the primary constraint.
Through his experience with software-hardware integration, Chen Jianyu has demonstrated that the limitations of hardware performance are crucial in determining intelligent capabilities. Only highly dexterous hands can manage complex manipulation tasks, and only robust motion capabilities can support a wide array of functions.
Both men converge on the significance of hardware-software collaboration without prior agreement. While Chen Jianyu underscores software's dominance, he acknowledges that hardware performance sets the upper limit of model performance. Wang Xingxing, prioritizing hardware, is also actively integrating large models to enhance robots' autonomous decision-making abilities.
Behind these agreements lies the industry's consensus that "robots are a systematic engineering." Without advanced models, hardware remains mere precise machinery; without reliable hardware, models are confined to laboratory algorithms.
From the current technological landscape to the ideal of general embodied intelligence, the humanoid robot industry must still navigate multiple developmental hurdles. Through the insights of Chen Jianyu and Wang Xingxing, we can envisage a possible path for the industry's future:
Short-term (1-3 years): Various technological approaches, such as end-to-end VLA, world models, and video generation, will evolve in parallel, learning and integrating from each other. Leading enterprises will achieve small-scale deployments in specific industrial scenarios to validate commercial viability.
Medium-term (3-5 years): A "ChatGPT moment" may emerge, with technological advancements enhancing general capabilities. The industry will gradually establish unified technical standards, and application scenarios will expand from industry to multiple commercial fields like logistics, healthcare, and retail.
Long-term (5-10 years): Humanoid robots are expected to enter households as "household partners," but issues like safety, reliability, and natural interaction need resolution. Continuous technological breakthroughs and iterations will be necessary.
In fact, Chen Jianyu views world models as a significant evolutionary direction under the VLA paradigm, and Wang Xingxing does not entirely dismiss the value of the end-to-end approach.
In the future, multiple technological paths may converge during the industry's development. End-to-end VLA models could incorporate world models' prediction and reasoning capabilities to enhance performance in unfamiliar environments. Similarly, world models might draw from VLA's architectural design to improve real-time interaction capabilities.
No two leaves are alike, and in this embodied intelligence competition, no enterprise adopts identical tactics, paradigms, or engineering focuses.
Amidst a period of paradigm convergence, diversity is crucial in guiding the industry towards the correct path.
When practical testing and paradigm reflection run in tandem and validate each other, the industry can achieve rapid progress without prematurely falling into technological rigidity.
From technological accumulation to industrialization, from being a fleeting competition winner to becoming a reliable household assistant, this journey is bound to be lengthy and fraught with uncertainties.
The coming years may prove to be the most dramatic and pivotal stage. Only by continually reflecting on underlying paradigms and maintaining a "technology-business" closed loop can these forces combine to transform humanoid robots into sustainable productivity.