Can WeChat AI Sidestep the Quandary Faced by Doubao Phone?

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

06/30 2026 340

Beyond the debates over technological paths, the consumer-end (C-end) AI ecosystem confronts an even more urgent challenge: the allocation of commercial interests.

Last week, WeChat AI initiated a small-scale internal trial, offering outsiders a peek into some potential application scenarios of this widely-used app in the AI domain. The AI assistant, dubbed Xiaowei, can summarize WeChat Moments and official account articles, as well as directly send messages and red envelopes to contacts.

Beyond these basic functions, what has drawn more industry attention is WeChat's endeavor to link AI with external services, equipping the AI assistant with execution and task-management capabilities. For instance, instructing Xiaowei allows for one-sentence taxi bookings and food orders. When a user directly requests an iced Americano (defaulting to voice input) in the AI conversation interface, Xiaowei can automatically call up WeChat mini-programs from Luckin Coffee or Starbucks based on user preferences, select the product, but will require the user to confirm the order and complete payment manually within the mini-program.

Currently, leading internet firms including ByteDance, Alibaba, Tencent, and Ant Group are all driving AI to evolve from simple chatbots to agents capable of execution and task management. However, behind this integration of AI services lies a dual test: the appeal of the application ecosystem and, when user-initiated access shifts to AI-initiated invocation, the delicate balance between developers, users, and AI entry points becomes crucial for the closed-loop success of agent services.

01 Why Different Fates for WeChat and Doubao

Late last year, Doubao Phone was launched and swiftly garnered attention for its bold AI capability experiments. At that time, the Doubao Phone assistant collaborated with ZTE, securing extensive operating system-level permissions, including the crucial INJECT_EVENTS permission, enabling the AI assistant to read screen information and simulate user clicks through GUI Agent. Despite limited promotion, its initial impact reverberated through the industry, with many hailing it as a milestone event for agents.

However, Doubao Phone's agent experiments soon encountered resistance from numerous major company apps, including WeChat, which refused access from Doubao Phone. Indeed, AI assistants from phone manufacturers also possess similar system-level permissions, but those with large user bases have not been as aggressive as Doubao Phone in practice.

WeChat AI, in contrast, has opted for a more ecosystem-friendly approach, necessitating mutual consent from users and developers for AI-connected services. Half a month prior to the internal release of Xiaowei, WeChat specifically issued the "Guidelines for Developers to Integrate into the WeChat AI Ecosystem," with 13 companies including JD.com, Meituan, Ctrip, KFC, and Dewu becoming the inaugural batch of internal test teams for the WeChat AI ecosystem.

In WeChat AI's development documentation, two integration methods for WeChat Xiaowei are outlined. The automatic mode requires no additional code submission; simply activating the authorization button in the background suffices, suitable for lightweight tools and simple-function mini-programs. The development mode necessitates an application and can incorporate interface declarations and modifications based on business characteristics, suitable for transactional, medical, governmental, and other mini-programs with high compliance requirements and complex business logic.

However, a developer informed Shuzhi Qianxian that WeChat Xiaowei's approach does not rely on the A2A protocol but instead leverages WeChat's own mini-program ecosystem and developer interfaces to achieve service invocation and task execution.

"It's essentially transforming mini-programs into MCP-interfaced ones, opening MCP interfaces. It's not the A2A logic," the developer told Shuzhi Qianxian. MCP packages APIs into AI-readable interfaces, capable of transmitting data and encapsulating tools.

Application providers can decide which capabilities to expose to MCP and which to keep private. Moreover, MCP servers feature a permission control system adhering to the principle of least privilege, ensuring that large models only invoke tools within secure boundaries. This means it provides a safe and controllable operation path for users.

Industry insiders informed Shuzhi Qianxian that there is no inherent superiority or inferiority between these two technological routes. WeChat's stronger appeal in the AI ecosystem, besides its vast user base, is closely tied to the mini-program ecosystem it began building a decade ago. WeChat has integrated millions of mini-programs covering nearly every aspect of daily life. These mini-programs, utilizing WeChat's standard interfaces, can be swiftly invoked by agents with some intelligent modifications.

Doubao, lacking an application ecosystem, initially opted for a more aggressive GUI route. In their view, many major application companies that rejected Doubao Phone's simulated clicks under the guise of security were not necessarily intimidated by the technology itself but were concerned about user traffic being controlled by Doubao Phone. Indeed, companies like KFC, JD.com, and Baidu did not prohibit Doubao Phone's access.

However, there are reports that ByteDance's second-generation Doubao Phone, in collaboration with ZTE, is poised for release. The new Doubao Phone, besides the GUI route of "screen recognition + simulated clicks," is also promoting interconnection through interface protocols.

The Doubao App has also fortified its connections with external applications, integrating Douyin e-commerce and payment capabilities, enabling users to directly purchase products within Doubao's conversations. Additionally, Doubao has initiated gray-scale testing for one-click taxi bookings in Beijing and Hangzhou, where users can directly state their travel needs in the chatbox, and the system automatically identifies the location, number of people, and preferences, matching routes and prices before confirming the order with one click.

02 Multiple Technological Routes Coexisting as Mainstream

Despite the current controversies surrounding the GUI Agent approach, the advantages and disadvantages of this technological route are quite apparent. Relying on image recognition + simulated clicks, it can swiftly connect a vast array of application ecosystems without fretting over whether interface protocols are already interconnected, or even without the permission of application providers, especially for numerous long-tail applications. Utilizing the GUI Agent approach is the quickest method.

However, the cost is that this somewhat intrusive method can easily trigger alarms among application providers, bypassing underlying protocol integration. Moreover, GUI also has technical shortcomings. For instance, when encountering small fonts, blurriness, dynamic loading, complex layouts, or similar controls, recognition accuracy is hard to guarantee, and visual model inference costs are high. Additionally, faced with dynamic scenarios like pop-ups, network anomalies, and page loading delays, GUI Agents lack underlying system awareness, making it difficult to accurately judge the current interface state, leading to operation failures or infinite loops.

Ctrip mentioned in a technical article that when using closed-source models for GUI Agent tasks in OTA scenarios, there are two types of defects: one is the lack of understanding of how to operate Trip.com's UI components; the other is the low success rate for long-term tasks (such as "entering the domestic hotel list from the homepage, selecting a bookable hotel, and entering the reservation form").

Furthermore, compared to directly invoking API interfaces, GUI processing consumes more tokens. "GUI Agent is a last-resort solution when there is no way to achieve interconnection," IDC analyst Sun Zhenya told Shuzhi Qianxian. Nowadays, browser invocations basically do not rely on GUI processing; most browser operations can be completed through CRI and are very efficient.

However, this does not imply the GUI Agent approach is devoid of value. In an industry discussion on GUI Agents, participants believed that agent technology is trending towards a hybrid model combining API calls and visual capabilities. This means agents can efficiently interact with mature systems (such as ticket-booking and hotel-booking apps) through precise API interfaces and can also understand and operate generic graphical interfaces (GUIs) without APIs through visual understanding.

For example, high-frequency, standardized tasks like booking flights and listening to music can be swiftly and stably accomplished through API calls. Meanwhile, a vast number of non-standardized long-tail tasks rely on screen recognition + simulated clicks.

Jiang Yuchen, the Director of Smart Product R&D at OPPO ColorOS, believes that GUI Agents are an intermediate transitional form and will eventually evolve towards A2A.

The interconnection between agents is deemed an ideal approach for the future, as it can maximize data security, ensure user retention, and share token costs, effectively balancing the interests of all parties.

However, the China Academy of Information and Communications Technology (CAICT) also mentioned that issues in agent interaction are gradually surfacing, such as identity credibility, authorization boundaries, data security, and responsibility tracing. When agents developed by different platforms and entities enter the same interaction network, it is necessary to clarify "who is initiating the request, on whose behalf, and whether they have the corresponding permissions." If different vendors each construct closed protocol systems, it may create new ecological barriers and redundant construction, hindering the healthy development of the agent industry.

On June 26, the State Administration for Market Regulation explicitly mentioned at a press conference on agent interconnection standardization that the lack of unified interfaces and protocols among agents from different vendors has formed "agent islands," severely restricting large-scale collaborative applications. Last year, relevant institutions already introduced the AIP agent interconnection protocol at the national standard level.

In fact, whether it's major internet companies or terminal system manufacturers like phone companies, they currently generally adopt a multi-technological-route approach for AI service access. Gemini Spark demonstrated at Google I/O supports three schemes simultaneously: OCR simulated clicks, cooperative software API access, and A2A.

Google's APP function framework, released last year, also aids third-party app applications and AI models in connecting through a set of standard interface specifications.

For example, the Samsung Galaxy S26 introduced Google's Gemini agent through this framework, enabling the top 200 apps in the Samsung app store to support Gemini invocations. Users can instruct Gemini to find specific photos in the gallery and send them to friends via text message. Throughout this process, Gemini does not need to open the gallery and messaging apps but uses AppFunctions to fetch the corresponding entries into Gemini for execution, enhancing efficiency.

Besides Google, Apple also has a similar framework, App Intents. In Apple's vision, users can instruct Siri to operate various apps, with the underlying implementation through App Intents.

The YOYO agent platform on Honor phones also offers three access methods for different developers: agent A2A access, MCP access, and plugin access. For example, Honor AI services configure cards into universal templates, allowing developers to embed agent services into the Honor YOYO agent conversation flow by simply providing content according to the corresponding template cards without undergoing complex processes like design, development, configuration, and testing. Ant Group's AI assistant, Afu, accessed the Honor YOYO agent in this manner.

"Invoking apps through agents will definitely be a trend in the future," CAICT analyst Ma Mingyang told Shuzhi Qianxian.

03 The Competition for AI Entry Points Tests the Redistribution of Interests

As major internet companies and terminal manufacturers like phone companies actively vie for entry points in the AI era, beyond the debates over technological routes, the C-end AI ecosystem faces an even more pressing challenge: the allocation of commercial interests.

Industry insiders informed Shuzhi Qianxian that multi-agent collaboration within enterprises is already relatively common. For instance, in enterprise data analysis, data insight agents, data fusion agents, and attribution analysis agents are invoked behind the scenes, with each agent responsible for a clear task, ultimately delivering a complete result. However, in ToC applications, interconnection with third-party app agents is still rare.

Besides the immaturity of multi-agent systems themselves, the core reason is that when these general-purpose AI assistants connect to external services, they inevitably encounter new commercial allocation challenges. Regardless of the technological route, they cannot avoid the same issue: when user-initiated access shifts to AI-initiated invocation, the user's intentions, needs, and subsequent service choices are in the hands of the AI assistant. Even the entire operation may not require jumping to third-party platforms, and concerns about app pipelining persist.

Even within WeChat's mini-program ecosystem, the essence is still users directly searching for services. Application providers accessing mini-programs gain an additional channel to reach users. However, in the AI era, AI actively comprehends needs and selects services, with mini-programs becoming passively responsive. Questions like who the users belong to, how services will be orchestrated and scheduled, user retention, and cost-sharing currently lack clear answers.

This fundamental shift in commercial logic has also dampened some developers' enthusiasm for service invocation by AI assistants.

Last year, the poetry app Xichuangzhu adapted to both Apple Intelligence and Huawei Xiaoyi, opting for the most cost-effective lightweight integration approach. This implementation was limited to page jumps and parameter passing, preventing the AI assistant from directly accessing or manipulating internal app data or performing automated operations.

"Without redirecting users to the app, there's no traffic," Xichuangzhu's founder, Qu Zhangcai, explained to Shuzhi Qianxian. This challenge is not unique to Xichuangzhu; it's a common dilemma faced by third-party applications today. As AI becomes a unified service scheduling hub, apps are increasingly being streamlined, posing a threat to traditional advertising-based revenue models. Moreover, even without A2A (AI-to-AI) interactions, providing API interfaces to AI assistants entails significant IT resource consumption for each API request—a substantial cost for small development teams.

Furthermore, the question of who will bear the token costs generated by multi-agent collaboration remains unresolved. "The overall landscape is still very new, and I feel that neither the regulatory nor industrial frameworks are fully mature. Generally speaking, those two major companies aren't lacking in funds, so they might temporarily cover the token costs," Ma Mingyang commented.

Despite these challenges, many application providers have embraced deep integration. For instance, East Money and Guotai Haitong Securities have integrated with Huawei Xiaoyi by encapsulating multiple Skills, enabling users to directly select stocks and query market information within the Xiaoyi assistant's conversation interface, without ever leaving the chat.

From the perspective of industry insiders, applications that are more service-oriented and require robust offline fulfillment capabilities are more inclined to collaborate with these general-purpose AI assistants. This is because the final service delivery ultimately relies on these providers, who can, in turn, attract more targeted traffic. Examples include Didi, Gaode, and KFC. Additionally, financial and health applications that rely on professional knowledge systems are often more willing to be invoked by AI assistants, such as East Money and Ant Group's Afu. Conversely, applications that depend on advertising revenue and bid ranking have more reservations. Direct service invocation by AI assistants reduces the likelihood of users opening the app, to some extent, and makes it challenging for these applications to retain users within their own ecosystems (the term "precipitate" is retained here to convey the original sentiment of user retention challenges).

The exploration and competition in the AI Agent field are just beginning, with technology, user experience, and commercialization all still in their nascent stages. However, one thing is clear: a thriving ecosystem must benefit developers, users, and AI entry points equally.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links