Vision-Language-Action Model

The VLA model combines vision,  language and action together,  enabling the smart driving system to not only recognize and describe the road,  environment,  traffic signs,  and road participants,  but also understand complex scenarios such as negotiation and hidden semantic information,  with advanced logical reasoning capabilities.

High performance
Handle complex scenarios efficiently to improve safety.
Highly cognitive
Strong capability to comprehend, reason and take actions accordingly.
Human-like
Imitate human driving behaviors. Improve driving experience.
Explainability
Transparent decision-making process to build trust.

End-to-end Model

Perception, prediction, planning and other modules are combined into one neural network in end-to-end model. Trained with numerous video clips, the smart driving system is capable to learn, think and analyze on its own to handle complex driving tasks.

From Modularized to End-to-end Model
  • Detection

  • Object tracking

  • Late fusion

  • Prediction

  • Decision

  • Planning

  • Control

  • Mapping

  • Localization

  • Prediction

  • Mapping

  • Localization

  • Decision

  • Planning

  • Multi-sensor fusion

  • Control

  • General Perception Net

  • Prediction Planning Net

  • Control

  • Initial road test of end-to-end model

  • Deploy VLA Model on consumer cars

Rule-based

More engineering, adequate data

2017
2022
2023
2025

Learning-based

Less engineering, more data

Data loop

With the help of map providers, We have complete data process including collection, labeling, cleansing, tagging, quality assurance, model training, test validation, and more. The data loop learns continuously, enabling smart driving systems to iterate and improve autonomously.

Model training

Numerous data

Data mining