New Victorian-Era AI Model Trained on Public Domain Data
- •Mr. Chatterbox trained exclusively on 2.9 billion tokens from Victorian-era British Library archives
- •Small 340-million parameter model demonstrates challenges of training without modern scraped datasets
- •Developer utilizes Karpathy's nanochat and Claude Code to build local model integration
Mr. Chatterbox represents a fascinating experiment in the ethics of AI training. Developed by Trip Venturella, the model was built using over 28,000 Victorian-era texts published between 1837 and 1899. By sourcing data exclusively from the British Library’s out-of-copyright collection, the project sidesteps the contentious "scraped data" debate that currently plagues the industry.
Despite its historical charm, the model highlights the steep requirements for modern performance. With only 340 million parameters and 2.93 billion training tokens, Mr. Chatterbox struggles with coherent reasoning. Technical reviewers describe the experience as similar to interacting with a Markov chain, a mathematical system that predicts the next word based purely on probability rather than deep contextual understanding.
The project underscores the Chinchilla scaling laws, which suggest that a model of this size needs significantly more data to reach a functional baseline. Nevertheless, the workflow used to deploy the model locally is notable. Utilizing the nanochat framework created by Andrej Karpathy, researchers successfully automated the creation of a plugin to run the model on personal hardware, proving how generative tools can bridge the gap between niche research and local execution.