04/08 2026
352

By Guo Jing
Source / Node AI
If we simply liken artificial intelligence to endowing machines with human-like intelligence, then judging by the development of today's large language models, AI has the potential to replace many entry-level professionals in fields such as programming, image generation, and text processing.
However, it is premature to declare 'AI replacing humans' just yet.
Humans possess spatial awareness, the ability to smell, touch, and other sensory capabilities. Today's large AI models still face significant challenges in truly understanding the physical world. This path is also recognized by industry leaders like Jensen Huang and Fei-Fei Li as the necessary route to achieving Artificial General Intelligence (AGI).
So, how can we bridge this gap?
At the Boao Forum for Asia, Hu Baishan, President and Chief Operating Officer of vivo, offered a clear perspective: before advanced physical large models emerge, achieving a superior user experience requires converting information from the physical world into the digital realm.
He believes that smartphones—devices he considers indispensable for the next decade—should be the tools for this transformation.
Physical AI remains largely uncharted territory. Before its true implementation, any curiosity and imagination deserve exploration. Let's delve into the insights of this industry veteran: what he envisions, what he's betting on, and the likelihood of success in this endeavor.
Perception is Key in the AI Era

The past two years have seen a flurry of industry trends, including AI smartphones, edge models, and embodied intelligence, leaving many feeling overwhelmed. Fearful of missing out, smartphone brands have rapidly expanded their AI smartphone offerings. Initially, the industry believed that model capabilities would become a key differentiator for smartphone manufacturers.
Hu Baishan disagrees. He points out that, compared to models, accumulated scenario data holds the greatest potential for differentiation.
Simply put, it's about AI's ability to perceive and understand specific physical scenarios.
At the Boao Forum, Hu Baishan used a vivid metaphor: without perception, AI is like a master confined to a dark room—no matter how capable, it cannot perceive the world just inches away.
His reasoning is sound: future models will likely become increasingly similar, with open-source speeds accelerating and the gaps between them narrowing.
Upon closer inspection, this trend is evident. DeepSeek's open-source large models set a precedent last year. A year later, the landscape has changed, with companies like Zhipu, MiniMax, and Kimi catching up. If smartphone companies merely add AI capabilities, the differences between them will become negligible.
Hu Baishan believes that vivo's differentiation can lie in perception.
What does perception entail? It's not just about touch or smell. vivo interprets it as understanding light and shadow, spatial relationships, the dynamics of a scene, and even human emotional states.
As a proactive step, he announced at Boao an internal decision recently finalized at vivo: this year, vivo officially established a new long-term technology track—the Perception Track.
However, physical AI is still in its infancy, with no existing open-source solutions in the industry to draw from, making true implementation a daunting task. Hu Baishan himself admits: this field lacks open-source resources and requires independent exploration.
Choosing the right direction doesn't guarantee an easy path. Next, we must examine how vivo plans to acquire these perceptual capabilities.
What is the Grip for Perception?

Hu Baishan believes that enabling AI to interact with the physical world requires a robust perception system. For vivo, the core focus for training this system is imaging.
How exactly? By reviewing Hu Baishan's speeches and interviews, it can be summarized as a synergistic approach: using hardware to collect data and software to transform it into perceptual data, thereby forming a data moat.
First, hardware.
When people think of AI pre-training data, they often consider images or text corpora. However, data for embodied intelligence is different: machines must learn human behavior in the real world. A typical scenario involves humans performing actions while machines observe and collect data.
vivo relies on imaging as its 'eyes' for data collection.
vivo notes that the main sensor in the X300 Ultra has been upgraded to 1/1.12 inches. Collaboration with Sony focuses on semiconductor conversion efficiency. For example, Hu Baishan mentioned a new technological path capable of pushing light conversion rates in photosensitive elements from 90% to over 110%.
Hu Baishan's judgment aligns with industry observers: sensor sizes have reached a point of diminishing returns, with greater potential in conversion efficiency and external form factors. The X300 Ultra already features 200mm and 400mm fixed-focus extenders, with more upgrades on the horizon—continuous hardware enhancements that help vivo better understand its users.
But merely 'seeing' isn't enough; machines must also 'understand.'
Now, software.
vivo has deployed multiple specialized agents on the device side. One agent judges what you're photographing, which focal length to use, and the lighting conditions; another organizes your photo album, recommends filters based on editing habits, and even automatically edits footage into short videos.
You might raise concerns about data privacy, but there's no need to worry. vivo relies on on-device AI rather than cloud AI, characterized by low latency, high privacy, and weak network dependency. Over time, this allows for data that more closely aligns with user scenarios, building the aforementioned differentiation.
In summary, vivo aims to convert visual, auditory, tactile, and other sensory information into machine-understandable data about the physical world through sensors combined with large models.
Currently, vivo is already deploying custom AI chips and 3B-parameter edge models. The next step is ensuring stable output after large-scale commercialization to bring these ideas to fruition.
Hu Baishan predicts that smartphones will evolve from mere Smartphones to Agent Phones, ceasing to be just tools and becoming companions instead.
Here, I must also point out that the success of this vision hinges on a critical question: can the data flywheel on the device side truly gain momentum? If Agent Phone experiences fail to impress, users won't engage, and data won't accumulate—a classic chicken-and-egg challenge.
Will Robotics Land in a Decade?

Beyond Agent Phones, vivo is also exploring the technological boundaries of robotics.
This stems from Hu Baishan's understanding of future technological structures: AI and robotics represent the core technological directions of the digital and physical worlds, respectively. With the broadest user base and data entry point, smartphones may become the hub connecting the two. During a group interview at the Boao Forum, he put it bluntly: smartphones connect the digital world, while robots connect the physical world—the two may ultimately form a unified technological system.
vivo is already laying the groundwork for this goal. In 2025, vivo established a Robotics Lab, focusing on developing the robot's 'brain' and 'eyes,' with household scenarios as the long-term direction.
Hu Baishan remains cautious, concentrating resources on the most critical technological points within user scenarios.
Shao Hao, Chief Scientist at vivo's Robotics Lab, provides a specific definition of user scenarios: covering a complete closed loop from when users enter their home and remove outerwear, through laundry, drying, and storage processes.
Of course, vivo isn't just talking big—they don't aim for full autonomy (L4 level) right away. Instead, they offer a rough timeline: initially, 95% of operations may rely on human-machine collaboration, gradually reducing human involvement to 60%, 30%, and finally 0% after a decade.
Hu Baishan calls this strategy 'laying eggs along the way.' From Node AI's perspective, this gradual approach is quite pragmatic. The technical maturity of the robotics sector is far from reaching the tipping point for consumer-grade adoption. Pursuing full autonomy too early would be prohibitively expensive.

vivo hopes to start with human-machine collaboration, using real-world scenario data to iterate models. Notice how similar this logic is to the data-as-competitive-barrier approach in smartphones: first establish data flow, then focus on specific implementations. With a clear direction and mature technology, everything else will naturally fall into place.
This logic also faces challenges.
Xiaomi has deployed earlier and more extensively in robotics, investing in a batch of industrial chain (supply chain) companies. Huawei, leveraging its HarmonyOS ecosystem, is also well-positioned to enter robot operating systems. vivo chooses to focus only on the 'brain and eyes,' leaving hardware to supply chain partners—a lighter asset model but with weaker supply chain control.
Whether Hu Baishan's vision succeeds depends on whether vivo's core smartphone business can continue providing financial support, whether its AI capabilities remain leading, and whether the commercialization pace of robotics aligns with expectations. Nothing can falter.
At Boao, Hu Baishan said, 'Press the accelerator when cognition is clear; proceed slowly when it isn'.
This is very practical advice. Over a five-to-ten-year race track, perhaps the winner isn't who runs fastest but who lasts the longest.
In Node AI's view, vivo has already outlined its blueprint for the next decade—now comes step-by-step implementation.
*Cover image generated by AI