Apple Research Paper Argues That LLMs Don’t Mathematically Reason

It’s another particular angle on a question that is important if companies are to use the new version of artificial intelligence in the ways promotors say will be possible.

Generative artificial intelligence in the form of text-based large language models has been touted as something that could develop into artificial general intelligence (AGI) — if it isn’t there already, say the promoters.

AGI is supposed to be software that can approach or even surpass human cognition in areas like reasoning, problem-solving, perception, learning, and language comprehension, according to McKinsey. Potential buyers and users of such technology need to know whether the promises are reasonable today, something years out, or claims that are fantasy.

More studies are coming out, some of which cast doubts on how effective LLMs are in reasoning. A recent study by researchers at the University of Virginia and the University of Washington suggested that some of the more important promised applications in CRE — those involving time series forecasting — might not be good ones.

Now, a study from a half-dozen researchers at Apple also takes another grim view of the current types of software and what they might be capable of. Specifically, how well it can undertake mathematical reasoning, which is a foundational block of any sort of financial analysis.

The researchers adapted an existing benchmark system designed to test the systems by generating questions. They found that the LLMs “exhibit noticeable variance when responding to different instantiations of the same question.” Put otherwise, when the researchers presented the same question worded differently, they received different answers.

An example is saying that someone picked 44 kiwi fruit on Friday, another 58 kiwis on Saturday, and then on Sunday twice what was picked on Friday. What was the total? The answer is simple: 190 and the LLM was able to generate that answer.

But then, the question changed slightly, adding that on Sunday, five of the kiwis were slightly smaller than average. The answer to the total picked should still have been 190, but the LLM said that because they were smaller, it was necessary to subtract those five from the Sunday count, reducing the total to 185.

As the researchers noted, such variations on questions added information that was irrelevant to the reasoning and conclusion. “However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes,” they wrote.

They guessed that the reason for the change was that current LLMs didn’t perform actual logical reasoning. Rather, they “attempt to replicate the reasoning steps observed in their training data.”