What are the key points?

New ConStory-Bench evaluates narrative consistency across 2,000 prompts and 19 fine-grained error subtypes. ConStory-Checker automated pipeline detects contradictions by grounding judgments in explicit textual evidence and quotations. Research reveals consistency errors peak in story midsections and correlate with high token-level entropy.

LLMs Struggle with Long-Form Narrative Consistency

•New ConStory-Bench evaluates narrative consistency across 2,000 prompts and 19 fine-grained error subtypes.
•ConStory-Checker automated pipeline detects contradictions by grounding judgments in explicit textual evidence and quotations.
•Research reveals consistency errors peak in story midsections and correlate with high token-level entropy.

Large Language Models have mastered the art of generating coherent sentences, but they often "lose the plot" when tasked with writing long-form narratives. As stories stretch into tens of thousands of words, models frequently stumble over their own world-building, contradicting established character traits or flipping temporal logic. To quantify these hallucinations, researchers introduced ConStory-Bench, a specialized framework designed to audit narrative integrity rather than just fluency or plot quality.

The benchmark categorizes inconsistencies into a taxonomy of five major error types. By analyzing 2,000 diverse prompts, the study found that LLMs are most prone to factual and temporal lapses—essentially forgetting what happened and when. Interestingly, these bugs aren't distributed evenly; they tend to cluster in the middle of a story. This suggests that as the context window (the amount of text the AI can remember at once) fills up, the model's grasp on the narrative's foundation begins to slip.

A key innovation is ConStory-Checker, an automated pipeline that doesn't just flag errors but provides evidence by citing exact quotations from the text. This grounding makes the evaluation reproducible and auditable. The researchers also discovered a link between consistency bugs and token-level entropy, which is a measure of how "uncertain" the model is about its next word choice. When the AI is unsure, it is significantly more likely to break its own established rules.

Large Language Models have mastered the art of generating coherent sentences, but they often "lose the plot" when tasked with writing long-form narratives. As stories stretch into tens of thousands of words, models frequently stumble over their own world-building, contradicting established character traits or flipping temporal logic. To quantify these hallucinations, researchers introduced ConStory-Bench, a specialized framework designed to audit narrative integrity rather than just fluency or plot quality.

The benchmark categorizes inconsistencies into a taxonomy of five major error types. By analyzing 2,000 diverse prompts, the study found that LLMs are most prone to factual and temporal lapses—essentially forgetting what happened and when. Interestingly, these bugs aren't distributed evenly; they tend to cluster in the middle of a story. This suggests that as the context window (the amount of text the AI can remember at once) fills up, the model's grasp on the narrative's foundation begins to slip.

A key innovation is ConStory-Checker, an automated pipeline that doesn't just flag errors but provides evidence by citing exact quotations from the text. This grounding makes the evaluation reproducible and auditable. The researchers also discovered a link between consistency bugs and token-level entropy, which is a measure of how "uncertain" the model is about its next word choice. When the AI is unsure, it is significantly more likely to break its own established rules.

LLMs Struggle with Long-Form Narrative Consistency

Tags