AI isn’t very good at history, new paper finds

Spread the love

AI can excel at certain tasks Like coding or Creating a podcast. But it struggles to pass a high-level history test, a new paper finds.

A team of researchers developed a new benchmark to test three top large language models (LLMs)—OpenAI’s GPT-4, Mater Llama, and Google’s Gemini—on historical questions. The benchmark, Hist-LLM, checks the correctness of answers against the Seshat Global History Databank, a massive database of historical knowledge named after the ancient Egyptian goddess of wisdom.

The result, which was presented Last month’s high-profile AI conference was disappointing, according to researchers associated with NeuriIPS Center for Complexity Science (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but it only achieved 46% accuracy — no higher than random guesses.

“The main takeaway from this study is that the LLM, while impressive, lacks the depth of understanding required for advanced history. They’re great for basic data, but when it comes to more fine-grained, PhD-level historical research, they’re still not up to the task,” said Maria Del Rio-Chanona, one of the paper’s co-authors and an associate professor of computer science at University College London. Dr.

The researchers shared with TechCrunch sample historical questions that the LLM got wrong. For example, GPT-4 Turbo asked if scale armor was present in ancient Egypt during a certain period. LLM says yes, but the technology only appeared in Egypt 1,500 years later.

Why are LLMs bad at answering technical history questions, when they can be so good at answering very complex questions about things like coding? Del Rio-Chanona told TechCrunch that this is likely because LLMs tend to extrapolate from historical data that is too specific, and more obscure historical knowledge is difficult to retrieve.

For example, researchers at GPT-4 asked whether ancient Egypt had a professional standing army during a certain historical period. While not the correct answer, LLM wrongly answered that it did. This is probably because there is much public information about other ancient empires, such as Persia, which had standing armies.

“If you’re told A and B 100 times and C 1 time, and then asked a question about C, you might remember A and B and try to extrapolate from there,” says Del Rio-Chanona.

The researchers also identified other trends, including that the OpenAI and Llama models performed poorly for certain regions such as sub-Saharan Africa, suggesting possible bias in their training data.

The results show that LLMs are not yet a substitute for humans in certain domains, said Peter Turchin, who led the study and is a faculty member at CSH.

But the researchers are still optimistic that the LL.M. can help historians in the future. They are working to refine their benchmark by including more data from represented regions and adding more complex questions.

“Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential of these models to aid historical research,” the paper reads.

Leave a Reply

Your email address will not be published. Required fields are marked *