06/23 2025
473
With the rapid advancements in artificial intelligence and sensor technology, autonomous vehicles have increasingly captured public attention. Among the various technical routes for the perception system, the "pure vision solution," primarily relying on cameras for environmental perception with minimal or no use of lidar and millimeter-wave radars, has garnered significant interest, particularly due to the influence of companies like Tesla. This approach has even emerged as the primary technological direction pursued by automakers in 2024. While the pure vision route may seem cost-effective and human-like in perception, its drawbacks are becoming increasingly apparent as technology progresses.
Human Vision ≠ Machine Vision: Fundamental Differences in Cognitive Abilities
The pure vision solution aims to enable autonomous vehicles to "understand" the world like humans. Human drivers rely primarily on their eyes and brains to recognize road conditions, judge distances, and predict risks. However, the integration of cameras and neural networks cannot fully replicate human perception. The human brain, with millions of years of evolution in image processing, enables accurate object recognition and reasoning based on experience and common sense. For instance, drivers instinctively slow down when seeing children playing by the roadside. In contrast, deep neural networks, trained on limited samples, often perform inconsistently in unfamiliar or complex situations. Moreover, the pure vision system lacks "touch" or "depth perception" and cannot gauge distances or speeds like humans through stereoscopic vision and muscle feedback.
Limited Depth Perception: Large Errors in Distance Judgments
Cameras excel in capturing two-dimensional images, but autonomous driving necessitates an understanding of three-dimensional space. To simulate stereoscopic vision, the pure vision solution employs binocular or multi-camera systems to calculate object distances using parallax. However, this method has physical limitations and is significantly affected by the parallax baseline distance, imaging quality, and environmental lighting. While cameras can estimate distances accurately at close ranges (e.g., within 5 meters), errors in depth calculation increase sharply beyond 30 meters. In adverse weather conditions like night, backlight, rain, and snow, where image quality declines, depth estimation becomes even more prone to inaccuracies. Misjudging the distance of a vehicle ahead at high speeds at night could lead to failed braking and rear-end collisions.
High Sensitivity to Light Changes: Poor Adaptation to Extreme Environments
The pure vision system heavily relies on natural lighting and has a lower tolerance for light changes compared to radar sensors. While cameras can capture detailed road conditions in bright daylight, their performance declines significantly in scenarios like night, tunnels, strong backlight, snowy days, or rainy days. For instance, at night, oncoming vehicle headlights can cause strong glare, leading to overexposure, blurred target edges, and even object unrecognizability in camera images. In heavy fog or rain, camera imaging quality also degrades sharply, with blurred vision and increased signal noise, resulting in system misjudgments or perception failures.
In contrast, millimeter-wave radars and lidars, which do not rely on visible light but actively scan the environment using electromagnetic waves or lasers, exhibit strong all-weather capabilities, especially in low-visibility conditions. The pure vision solution's vulnerability in these environments poses a serious obstacle to its practical deployment.
Difficulties in Solving Occlusion and Blind Spot Issues
An autonomous driving system must handle complex road scenarios, including obstructed lines of sight, vehicle occlusions, intersections, etc. Cameras, reliant on visible light imaging, have limited fields of view and are prone to "invisible" blind spots. For example, if a vehicle is driving and there's a risk of pedestrians suddenly crossing the road behind it, the pure vision system, primarily relying on front camera images, may not "see" these pedestrians when completely occluded by the preceding vehicle. Recognition might only occur when the preceding vehicle brakes or changes lanes, leaving limited reaction time for the system.
Strong Dependence on Training Data, Weak Generalization Ability
The deep learning models behind the visual perception system rely heavily on vast amounts of clearly labeled and diverse training data. These data encompass various traffic scenarios, weather conditions, road types, etc., to train the system to recognize different objects and behaviors. However, real-world road variations are extremely complex, and new situations the model has never "seen" before always arise. If a region suddenly experiences torrential rain, flooding roads and erasing traffic markings, or if an unusual traffic sign appears that the system hasn't learned, the pure vision model often "fails to understand" and might even make wrong decisions due to the absence of similar cases in the training data. Moreover, deep learning models often have a "black box" attribute, making it difficult to explain their judgments, which complicates troubleshooting and system optimization. In contrast, multi-sensor fusion systems (e.g., lidar + vision + radar) have a multi-source data verification mechanism where another sensor can supplement when one fails, providing stronger robustness and generalization ability.
Lack of Speed Perception Ability, Difficult to Handle High-Speed Scenarios
An autonomous driving system must not only identify "what" an object is and "where" it is but also predict "where it will go" and "how fast it is moving." In high-speed scenarios, this time sensitivity is crucial. For instance, on highways, the system needs to predict the lane-changing behavior of the preceding vehicle based on its speed and acceleration and respond promptly. While cameras continuously capture image frames, they cannot directly obtain relative speed information of objects. The pure vision solution typically relies on optical flow technology or target tracking to infer speed, but this method is less accurate, especially under conditions like blurred images and varying sampling intervals, which can easily lead to inaccuracies and prevent the system from accurately predicting the movement trend of the preceding vehicle. In contrast, millimeter-wave radars can directly provide radial velocity data of targets, significantly enhancing the system's response to surrounding dynamic changes. In high-speed scenarios, this ability often determines the safety of the autonomous driving system. The pure vision solution's lack of this key technical capability makes it challenging to handle high-speed autonomous driving tasks independently.
Insufficient Safety Redundancy, Difficult to Meet Autonomous Driving Requirements Above L4
As autonomous driving levels advance to L3/L4, the requirements for system stability, fault tolerance, and safety redundancy become extremely high. Any sensor failure can lead to severe consequences. Therefore, automakers typically adopt the "sensor redundancy" strategy, where multiple different types of sensors corroborate each other to ensure the reliability of perception results. Due to cost considerations, the pure vision solution usually doesn't include redundant cameras or multi-type perception systems. If the main camera is contaminated, damaged, occluded, or the software crashes, the entire system's perception ability will be severely compromised or even completely lost. The safety risk of this single perception mode cannot meet the reliability requirements of high-level autonomous driving, especially in unmanned scenarios like Robotaxi and autonomous delivery vehicles, where manual intervention cannot be relied upon in a timely manner. Once a perception failure occurs, the vehicle can only stall or malfunction, posing significant safety hazards. In contrast, the multi-sensor fusion solution, with its stronger fault tolerance, can "take over" through other sensors to avoid a complete system breakdown.
Vision is the Foundation, Fusion is the Future
Vision undeniably plays a central role in autonomous driving perception. Cameras offer advantages of low cost, small size, and high information density, serving as the "eyes" of autonomous driving. However, relying solely on "eyes" cannot complete all driving tasks, especially in L4/L5 autonomous driving systems with high safety and redundancy requirements. Here, "ears" (radar), "fingers" (touch), and "brains" (maps and high-precision positioning) must also cooperate. From a technological evolution perspective, most autonomous driving systems capable of large-scale deployment adopt the "multi-sensor fusion" route. Vision, radar, lidar, ultrasound, IMU, etc., collectively constitute a complex sensing system, each performing its function and complementing each other in different scenarios to achieve safer and more reliable perception capabilities. While the pure vision solution has cost advantages, its technical drawbacks determine that it cannot independently support the development of high-level autonomous driving. The core of future autonomous driving is not "getting rid of radar" but "using it reasonably." Blindly rejecting non-visual sensors not only fails to reduce costs but may increase the probability of system errors, making it counterproductive.
-- END --