PFN Unveils New Evaluation Method for Japanese LLM Naturalness
- •PFN verifies a new method to accurately measure the linguistic naturalness of Japanese LLMs.
- •The technique identifies subtle awkwardness that conventional automated evaluation systems often overlook.
- •Relative evaluation prompts comparing AI outputs to human model answers significantly improve assessment accuracy.
Preferred Networks (PFN) has published findings on a new evaluation method designed to measure the naturalness of Japanese responses in its proprietary LLM, PLaMo. While benchmarks like "ELYZA-tasks-100" are currently standard, existing methods struggle to penalize responses that are factually accurate but linguistically awkward or contextually unnatural. Even powerful foreign models often exhibit subtle issues in Japanese-specific nuances or logical structures, making it a high-difficulty task for current Foundation Models acting as AI evaluators (LLM-as-a-Judge) to identify these flaws automatically.
The study revealed a "ceiling effect" where AI judges assigned overly generous scores when instructed only to rate naturalness on a five-point scale. To solve this, PFN introduced a "relative evaluation" approach where the AI compares its output against a high-quality, human-written "model answer" to determine which is more natural. This method successfully quantified the performance gap between models with extensive Japanese training, such as PLaMo-2.2-Prime, and those that retain unnatural, translation-like qualities. By doing so, PFN can now visualize the "room for improvement" even in models that appear to achieve near-perfect scores.
These results represent a vital step toward achieving natural dialogue that feels intuitive to Japanese users, moving beyond simple factual accuracy. PFN plans to accelerate the development of user-friendly domestic LLMs by maintaining high-precision evaluation while controlling costs. While recent AI development has focused heavily on reasoning capabilities, this research highlights the critical importance of using objective Benchmarks to measure the fundamental linguistic naturalness of an AI in its native language.