A team of researchers has developed a novel benchmark to evaluate the historical knowledge of leading large language models (LLMs) and found significant limitations in their accuracy. The benchmark, named Hist-LLM, assesses the correctness of LLMs’ responses based on the Seshat Global History Databank, a comprehensive repository of historical knowledge. The study tested three prominent LLMs: OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini, with findings that reveal the models’ struggle with nuanced historical inquiries.
The Seshat Global History Databank, named after the ancient Egyptian goddess of wisdom, served as the standard for evaluating the models. Among the tested LLMs, GPT-4 Turbo emerged as the best performer, yet it achieved only about 46% accuracy. This rate is not much higher than random guessing, highlighting significant gaps in the models’ historical understanding.
Researchers provided examples of historical questions that challenged the models. One such question asked GPT-4 if ancient Egypt had a professional standing army during a specific period. The model incorrectly affirmed this, though such military structures appeared in Egypt 1,500 years later. The study revealed trends suggesting potential biases in training data, particularly noting that OpenAI and Llama models performed worse for regions like sub-Saharan Africa.
Maria del Rio-Chanona, a co-author of the study, emphasized that while LLMs are adept at handling basic facts, they fall short in more complex historical analyses.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task.” – Maria del Rio-Chanona
Lead researcher Peter Turchin echoed these sentiments, stating that LLMs cannot yet replace human expertise in certain domains. The study discovered that LLMs often extrapolate from widely known historical data, making it difficult for them to access less prominent information.
“If you get told A and B 100 times, and C 1 time, and then get asked a question about C, you might just remember A and B and try to extrapolate from that.” – Maria del Rio-Chanona
The paper further noted that while there is ample public information about ancient empires like Persia having standing armies, similar details about less-documented regions may not be as accessible to LLMs.
What The Author Thinks
This benchmark study highlights a critical challenge for LLMs in handling historical knowledge with the required accuracy and depth. Despite their advanced capabilities, these models still fall short when faced with complex historical analysis, underscoring the need for further advancements in AI technology. The study not only reflects on the technological limitations but also suggests a broader implication for the use of AI in educational and research settings, where accuracy and depth are paramount.
Featured image credit: Rawpixel via Freepik
Follow us for more breaking news on DMR