OpenAI has released a new benchmark that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, called GDPval, is an early attempt at understanding how close OpenAI’s systems are to outperforming humans at “economically valuable work,” a key part of the company’s mission to develop artificial general intelligence (AGI). OpenAI claims that its GPT-5 model and Anthropic’s Claude Opus 4.1 “are already approaching the quality of work produced by industry experts.”
The GDPval benchmark is based on nine industries that contribute the most to America’s gross domestic product, including healthcare, finance, manufacturing, and government. The test measures an AI model’s performance in 44 occupations, ranging from software engineers to nurses to journalists. For this first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals and then choose the best one. OpenAI then averages the AI model’s “win rate” against the human reports across all 44 occupations.
GPT-5 and Anthropic’s Claude Show Strong Results
OpenAI’s “souped-up” version of GPT-5, called GPT-5-high, was ranked as being better than or on par with industry experts 40.6% of the time. The company also tested Anthropic’s Claude Opus 4.1 model, which was ranked as better than or on par with industry experts in 49% of tasks. OpenAI believes that Claude scored so high because of its tendency to produce pleasing graphics, rather than its raw performance.
OpenAI acknowledges that most working professionals do a lot more than submit research reports, which is all that GDPval-v0 tests for. The company says it plans to create more robust tests in the future that can account for more industries and interactive workflows. Nonetheless, OpenAI sees the progress on GDPval as notable. For comparison, OpenAI’s GPT-4o model, which was released roughly 15 months ago, scored just 13.7% (wins and ties versus humans), a trend that OpenAI expects to continue.
A Glimpse into the Future of Work
OpenAI’s chief economist, Dr. Aaron Chatterji, said the results suggest that people in these jobs can now use AI models as a tool to spend time on more meaningful tasks. He said that because “the model is getting good at some of these things,” people in those jobs “can now use the model… to offload some of their work and do potentially higher value things.”
While other benchmarks for measuring AI progress, such as AIME 2025 and GPQA Diamond, are popular, many AI researchers have cited the need for better tests that can measure a model’s proficiency on real-world tasks. Benchmarks like GDPval could become increasingly important in that conversation, as OpenAI makes the case that its AI models are valuable for a wide range of industries.
Author’s Opinion
The GDPval benchmark represents a crucial shift in how AI progress is measured. Moving beyond abstract tests of math and science to a benchmark of “economically valuable work” is a more honest and relevant way to assess the technology’s real-world impact. While the current test is limited, it is a significant step towards creating a framework that can accurately predict the future of work and help both businesses and policymakers prepare for a world where AI is a powerful co-pilot in almost every job. This is a vital exercise that, if expanded, could give us a much clearer picture of how AI will transform our economy, and what role humans will play in it.
Featured image credit: Freepik
For more stories like it, click the +Follow button at the top of this page to follow us.