What are the key points?

TetrisBench evaluates LLM planning and long-horizon optimization through structured game data. Models perform best when generating logic-based scoring functions instead of selecting moves directly. Expert humans maintain an edge by using irregular board patterns that challenge AI optimization rules.

LLMs Test Strategic Planning in TetrisBench

•TetrisBench evaluates LLM planning and long-horizon optimization through structured game data.
•Models perform best when generating logic-based scoring functions instead of selecting moves directly.
•Expert humans maintain an edge by using irregular board patterns that challenge AI optimization rules.

Yoko Li, a partner at a16z, has introduced TetrisBench, a novel evaluation framework designed to probe the strategic depth of large language models (LLMs) through the classic game of Tetris. Unlike traditional tests that focus on chat or simple logic, this benchmark treats the game board as structured data, forcing models to navigate a continuous stream of trade-offs between immediate line clears and long-term survival.

Initial experiments showed that models struggle when asked to pick moves turn-by-turn. However, their performance improved dramatically when the problem was reframed as a coding task. By generating a scoring function—a specific set of rules to evaluate the board—the models created deterministic logic that outperformed direct human-like decision-making. This shift reveals that current AI is more adept at defining objective strategies than executing real-time spatial intuition.

The data also highlighted distinct behavioral styles. Gemini 3 Pro emerged as a leader with a 62% win rate, characterized by a highly efficient, low-intervention approach. In contrast, top-tier human players still hold an advantage by utilizing "controlled chaos." They create irregular board states (off-distribution) that the models' rigid optimization heuristics are not trained to handle.

This experiment suggests that a model’s "optimization horizon," or its ability to plan for the distant future, is a measurable behavioral trait. Understanding how and when models choose to rewrite their own strategies provides a new lens for evaluating the reliability of future autonomous agents in complex environments.

Yoko Li, a partner at a16z, has introduced TetrisBench, a novel evaluation framework designed to probe the strategic depth of large language models (LLMs) through the classic game of Tetris. Unlike traditional tests that focus on chat or simple logic, this benchmark treats the game board as structured data, forcing models to navigate a continuous stream of trade-offs between immediate line clears and long-term survival.

Initial experiments showed that models struggle when asked to pick moves turn-by-turn. However, their performance improved dramatically when the problem was reframed as a coding task. By generating a scoring function—a specific set of rules to evaluate the board—the models created deterministic logic that outperformed direct human-like decision-making. This shift reveals that current AI is more adept at defining objective strategies than executing real-time spatial intuition.

The data also highlighted distinct behavioral styles. Gemini 3 Pro emerged as a leader with a 62% win rate, characterized by a highly efficient, low-intervention approach. In contrast, top-tier human players still hold an advantage by utilizing "controlled chaos." They create irregular board states (off-distribution) that the models' rigid optimization heuristics are not trained to handle.

This experiment suggests that a model’s "optimization horizon," or its ability to plan for the distant future, is a measurable behavioral trait. Understanding how and when models choose to rewrite their own strategies provides a new lens for evaluating the reliability of future autonomous agents in complex environments.

LLMs Test Strategic Planning in TetrisBench

Tags