The list of informal, quirky AI benchmarks keeps growing. Recently, a test to see how different AI models handle a prompt like this has captured the attention of the AI community on X: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”
Comparing AI Models on the “Ball in Rotating Shape” Challenge
Some AI models perform better on this “ball in rotating shape” benchmark than others. One user on X pointed out that DeepSeek’s R1, a free model from a Chinese AI lab, outperformed OpenAI’s o1 Pro mode, which costs $200 per month as part of OpenAI’s ChatGPT Pro plan. Meanwhile, others noted that Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models failed to simulate the physics correctly, causing the ball to escape the shape.
On the other hand, some users reported that Google’s Gemini 2.0 Flash Thinking Experimental and OpenAI’s older GPT-4o nailed the evaluation on the first try.
What Does This Test Actually Prove?
Simulating a bouncing ball is a classic programming challenge that requires accurate collision detection algorithms to ensure that the ball doesn’t break the rules of physics. Poorly written algorithms can cause the simulation to either underperform or produce unrealistic physics. N8 Programs, a researcher at AI startup Nous Research, shared that it took him about two hours to program a bouncing ball within a rotating heptagon. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” he explained.
While the bouncing ball in a rotating shape is an interesting challenge, it’s not necessarily a valid AI benchmark. Small changes in the prompt can yield different results, making the benchmark not as empirical as one might think. Some users had better success with o1, while others found R1 lacking, showing that these informal tests are subjective and not fully reliable.
Ultimately, viral benchmarks like these highlight the challenge of creating meaningful and practical systems to measure AI models. It’s often tough to distinguish one model from another unless we create more standardized, relevant benchmarks.
There are ongoing efforts to develop more comprehensive benchmarks like the ARC-AGI and Humanity’s Last Exam, which aim to better evaluate AI models. As we await these more structured tests, users will likely continue watching entertaining GIFs of balls bouncing within rotating shapes.
Author’s Opinion
While these informal benchmarks may be fun, they don’t give us the insights needed to assess AI models meaningfully. The reality is that AI is a broad field, and no one-size-fits-all challenge will provide a true picture of a model’s capabilities. We need more nuanced and real-world tests to measure AI progress — benchmarks that can assess actual utility and problem-solving abilities, not just theoretical tasks that make for good viral content.
Featured image credit: Rawpixel via Freepik
Follow us for more breaking news on DMR