Artificial Analysis, a leading AI evaluation company, has reported significant expenditures on benchmarking artificial intelligence models, particularly those branded as “reasoning” models. The initial inquiry cost the company $141.22 testing OpenAI’s o1-mini and about $5,200 testing a dozen-plus reasoning models. This isn’t even close to the $2,400 that was spent on 80+ non-reasoning models.
Now costs associated with reasoning models are surging. AI labs such as OpenAI claim these more sophisticated reasoning models outperform their non-reasoning alternatives in fields such as physics. OpenAI’s o1 model performed extraordinarily well within their best benchmark tests, generating a whopping 44 million tokens. The GPT-4o model came up with about eight times as many tokens as o1.
Challenges of Benchmarking AI Models
Ross Taylor, the CEO of AI benchmarking startup General Reasoning filled in the gaps on the challenges of benchmarking. He said that benchmarking Claude 3.7 Sonnet against about 3,700 unique prompts set him back $580. He expressed his concerns about the reproducibility of test results. He said, “Nobody is going reproduce the findings.”
With AI technology innovation comes the increasing complexity and associated costs with the model evaluation. OpenAI’s GPT-4.5 and o1-pro models are currently $150 and $600 per million output tokens. Running the non-reasoning GPT-4o model only cost Artificial Analysis $108.85. By contrast, the evaluation for its predecessor, Claude 3.6 Sonnet, was $81.41.
We Artificial Analysis have spent $2,767.05 in developing our analysis, in order to assess and evaluate OpenAI’s o1 reasoning model. They rigorously tested it on seven widely used AI benchmarks. These benchmarks are MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500. Benchmarking Anthropic’s Claude 3.7 Sonnet on the same tests cost $1,485.35.
A Commitment to Rigorous Testing
Cameron from Artificial Analysis praised the company’s commitment to rigorous testing. As he said, “At Artificial Analysis, we conduct hundreds of evaluations every month and we devote a seven-figure budget every year to fund those endeavors. This adoption reflects an increasing investment in benchmarking as the need for unbiased, reproducible assessments of AI capabilities continues to rise.
Denain, a fellow industry veteran, offered useful context on how AI benchmarking costs have changed over time. Even better is the enormous progress that’s been made on models, he said. Hence, the cost to get to any given performance standard has fallen dramatically as well. He elaborated by saying that even though today’s benchmarks are more complicated, the total number of questions for each benchmark has gone down.
This stark uptick in cost begs the question of accessibility and transparency in public development of AI. As reasoning models gain traction in the industry, companies must navigate a landscape where benchmarking expenses can significantly impact research budgets and project viability.
OpenAI o3-mini-high $344.59 model inference This serves to show the disparate costs between some of the more popular complex AI models. The data indicates that reasoning models are more expensive to test thoroughly than similar, non-reasoning models.
What The Author Thinks
The rising costs of benchmarking AI reasoning models represent a significant challenge for the industry. While innovation in reasoning models has led to impressive advancements in AI, the increasing price of evaluations risks limiting access to unbiased, thorough testing. As companies invest in these models, the rising costs of evaluation could create barriers for smaller businesses and startups, hindering the fair assessment of AI capabilities. Without better access to affordable benchmarking, the ability to measure and improve AI models may be concentrated among larger, well-funded players, potentially stifling competition and innovation.
Featured image credit: Markus Spiske via Unsplash
Follow us for more breaking news on DMR