Wang Xingxing: Are AI Models the Biggest Problem for Robots? Why Are Large Models Insufficient?

09/15 2025 343

In recent years, alongside the rapid development of artificial intelligence and robotics, AI-driven technology has become a consensus among most people. However, Wang Xingxing, founder of the renowned robotics company Unitree Technology, recently stated that the biggest issue with robots currently is still AI models. What exactly is going on here? Why are the thriving large models insufficient?

I. Wang Xingxing: Are AI Models the Biggest Problem for Robots?

According to a report by The Paper, during the 2025 Bund Summit roundtable discussion, Wang Xingxing, founder and CEO of Unitree Technology, stated that in the field of robotics, hardware and the 'brain' (AI) are not on the same level. At present, robot hardware is entirely sufficient and 'can be used for one or two years without issue.' The biggest problem remains the insufficient capabilities of AI large models, particularly in multimodal fusion, where performance is still unsatisfactory.

Wang Xingxing noted that while pure language models or pure video models already deliver excellent results, effectively combining language and images remains a significant challenge. In robotics, there is currently no effective way to fully utilize hardware. For example, controlling a robot's dexterous hands with models still poses certain difficulties. He mentioned that although AI performs exceptionally well in information processing, text, and image applications, the field of getting AI to 'do real work' remains a 'desert with only a few sprouts,' and explosive growth has not yet arrived.

'This is an incredibly friendly era for young people. The AI era is a highly equitable one,' Wang Xingxing said. He believes that young people can use AI models to learn programming independently. He encouraged a more aggressive approach to AI models, viewing them not just as tools but as versatile instruments for relearning and acceptance, ultimately putting them to better use.

However, Wang Xingxing is not alone in his perspective. A popular online joke goes, 'I thought AI would help me with laundry and dishes so I could pursue art and creativity. Instead, AI is pursuing art and creativity while I’m stuck with laundry and dishes.'

II. Why Are Large Models Insufficient?

With the rapid advancement of artificial intelligence technology, robots have become a crucial component of modern technology. However, despite the swift development of large models, their practical performance remains unsatisfactory, particularly in robotics. This is the root of Wang Xingxing's statement. How should we interpret this?

Firstly, while large models have developed rapidly, most are still in their early stages. In recent years, the field of large models has seen tremendous activity, with numerous tech giants and research teams investing heavily in development. From simple early models to today's massive-scale, increasingly complex large models, the progress is astounding. However, we must recognize that most large models still operate at the level of logical reasoning. They can analyze and reason to some extent based on input information, producing seemingly reasonable outputs. Yet, this reasoning is largely based on existing data and preset rules, lacking true understanding and innovation.

Take large models in natural language processing as an example. They can generate fluent text and answer various questions but often deviate when handling deep and complex semantic understanding. For instance, they may struggle with subtle metaphors, puns, or culturally context-dependent expressions. Large models still have a long way to go in grasping the rich connotations and nuances of human language. Moreover, large models are still in their early stages of development, requiring continuous training and optimization. Each training session demands vast amounts of data and powerful computational resources, which are costly and fraught with uncertainties. Thus, from an overall development perspective, large models are still far from mature.

Secondly, while robot hardware now meets demands, the thinking patterns of large models differ significantly from humans. Robot hardware has seen remarkable advancements in recent years. Various advanced sensors, actuators, and mechanical structures have equipped robots with powerful environmental perception and motion control capabilities. For example, some industrial robots can precisely perform complex assembly tasks, while service robots can autonomously navigate and avoid obstacles indoors. However, hardware progress has not fully translated into improved robot intelligence. The key issue lies in the significant differences between the thinking patterns of large models and human cognition.

When humans solve problems, they often rely on intuition, experience, and creativity to make quick judgments and decisions. Simple tasks, such as identifying an object's purpose or understanding a scene's atmosphere, may come instinctively to humans. For large models, however, these tasks require considerable training time. Take image recognition as an example. While large models achieve high accuracy in identifying common objects, they may need extensive labeled data to recognize uncommon or symbolically significant images. Furthermore, large models typically rely on statistical patterns and matching, lacking an understanding of the essence of things. These differences in thinking patterns often leave large models struggling when faced with complex, ever-changing real-world scenarios.

Thirdly, large models can currently only replace basic tasks and falter when confronted with high-difficulty challenges. From their practical applications, large models excel at replacing and handling numerous foundational, repetitive, and rule-based tasks. For example, in customer service, they efficiently handle standardized Q&A; in content creation, they generate news articles, marketing copy, and other formatted texts; in industrial automation, they perform preset assembly and inspection tasks. However, their performance drops sharply when tasks become complex, involving multi-step reasoning, cross-domain knowledge integration, or dynamic environmental adaptation.

Take household service robots as an example. While they can easily follow voice commands like 'play music' or 'turn the lights on/off,' many real-life scenarios are ambiguous. For instance, a command like 'find the package I received yesterday, which might be on the shoe rack by the door or under the sofa' poses a significant challenge. The robot must understand time, objects, spatial locations, and possess visual search, object recognition, path planning, and interactive feedback capabilities—a massive challenge for current large models. Thus, large models remain in the 'tool' stage rather than the 'intelligent agent' stage, with clear limitations in handling the ambiguity and uncertainty prevalent in the real world.

Fourthly, embodied intelligence still has a long way to go in building a practical 'brain.' As a vital branch of artificial intelligence, embodied intelligence aims to equip robots with physical perception and action capabilities, enabling them to autonomously complete tasks in real environments. Today, more and more tool-like robots are entering the market, capable of performing specific operations in defined scenarios, such as moving goods or cleaning floors.

However, creating robots that can truly work like humans remains an enormous challenge. Take household chores as an example. A competent homemaker not only cleans rooms, cooks meals, and does laundry but also arranges daily affairs based on family members' habits and preferences, reacting swiftly to unexpected situations. This places extremely high demands on a robot's large model, requiring comprehensive life knowledge, emotional understanding, and social communication skills.

Currently, while some robots have learned performative actions like dancing, they are still far from genuine household labor or assistant roles. To enable robots to truly integrate into human life and become capable helpers, we must develop a highly advanced 'brain' that meets practical work demands—a feat that undoubtedly requires extensive large model training and practical experience.

Fifthly, where should the future of artificial intelligence head? For large models, simply increasing parameters in a low-quality manner is no longer meaningful. While expanding model parameter scales has somewhat improved performance, it has also introduced numerous issues, such as high training costs, slow inference speeds, and poor interpretability. Moreover, merely pursuing larger parameter scales does not fundamentally address the challenges large models face in embodied intelligence applications.

The top priority for large model evolution is to genuinely support the implementation of embodied intelligence. This requires efforts on multiple fronts. On one hand, we must optimize training methods and algorithms for large models, improving efficiency and quality to achieve better performance with less data and computational resources. On the other hand, we need to strengthen the deep integration of large models with robot hardware, achieving hardware-software collaborative optimization (collaborative optimization). By feeding real-time sensor data from robots back to large models, we enable better environmental perception and task understanding, leading to more accurate decisions and actions.

Therefore, the answer to Wang Xingxing's question is undoubtedly yes. The 'insufficiency' of large models does not stem from a lack of quantity but rather from the need for a qualitative leap in intelligence depth and practicality—this is where large models must focus their efforts.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.