What are the key points?

Code2World predicts future visual states of mobile apps by generating renderable HTML code instead of pixels. Researchers at AMAP-ML released AndroidCode, a dataset of 80,000 high-fidelity screen-action pairs for Agentic AI training. The 8B model rivals Frontier Model performance in UI prediction, increasing navigation success rates by 9.5%.

Code2World AI Predicts App Screens via Code Generation

•Code2World predicts future visual states of mobile apps by generating renderable HTML code instead of pixels.
•Researchers at AMAP-ML released AndroidCode, a dataset of 80,000 high-fidelity screen-action pairs for Agentic AI training.
•The 8B model rivals Frontier Model performance in UI prediction, increasing navigation success rates by 9.5%.

Navigating mobile apps autonomously is a complex task for Agentic AI, often requiring the system to anticipate what a screen will look like after a specific button is clicked. Researchers from AMAP-ML have introduced Code2World, a novel World Model that predicts these future visual states by generating renderable code, such as HTML, rather than raw pixels. By treating the user interface as structured code, the system achieves significantly better control and visual clarity compared to traditional image-based predictions, allowing for a deeper understanding of digital environments.

To build this capability, the team developed the AndroidCode dataset, which contains over 80,000 high-quality pairs of screen states and actions. They refined this data using a visual-feedback mechanism to ensure the generated code accurately reflects real-world app behavior. This approach overcomes the major industry hurdle of data scarcity, providing a rich corpus for training a Vision Language Model (VLM) to understand how mobile interfaces evolve during interaction.

The technical core of the system involves starting with an SFT model for basic layout following, then applying Reinforcement Learning where the AI is rewarded based on the visual accuracy of its rendered output. This 8-billion parameter model rivals the performance of a Frontier Model like GPT-5 in UI prediction tasks. Notably, it also acts as a powerful auxiliary tool, boosting the navigation success rates of smaller, more efficient models by nearly 10% on standard benchmarks like AndroidWorld.

Navigating mobile apps autonomously is a complex task for Agentic AI, often requiring the system to anticipate what a screen will look like after a specific button is clicked. Researchers from AMAP-ML have introduced Code2World, a novel World Model that predicts these future visual states by generating renderable code, such as HTML, rather than raw pixels. By treating the user interface as structured code, the system achieves significantly better control and visual clarity compared to traditional image-based predictions, allowing for a deeper understanding of digital environments.

To build this capability, the team developed the AndroidCode dataset, which contains over 80,000 high-quality pairs of screen states and actions. They refined this data using a visual-feedback mechanism to ensure the generated code accurately reflects real-world app behavior. This approach overcomes the major industry hurdle of data scarcity, providing a rich corpus for training a Vision Language Model (VLM) to understand how mobile interfaces evolve during interaction.

The technical core of the system involves starting with an SFT model for basic layout following, then applying Reinforcement Learning where the AI is rewarded based on the visual accuracy of its rendered output. This 8-billion parameter model rivals the performance of a Frontier Model like GPT-5 in UI prediction tasks. Notably, it also acts as a powerful auxiliary tool, boosting the navigation success rates of smaller, more efficient models by nearly 10% on standard benchmarks like AndroidWorld.

Code2World AI Predicts App Screens via Code Generation

Tags