AI models fail to outperform mathematicians in a complex research test

 

As part of the "First Proof" project, artificial intelligence underwent one of its most difficult math tests to date, as four AI systems were asked to solve ten complex research problems

As part of the "First Proof" project, artificial intelligence underwent one of its most difficult math tests to date, as four AI systems were asked to solve ten complex research problems.

These problems were not part of the training data for the participating models, and the answers were reviewed and evaluated by specialized mathematicians. This test is the first of its kind, combining highly complex problems and novel questions unfamiliar to artificial intelligence systems with formal evaluation by expert specialists.

The results showed that current AI models are still less efficient than top mathematicians in dealing with similar issues, and they lack mathematical intuition and remain prone to errors or what is known as "hallucination".

Ten researchers proposed these issues from their unpublished scientific work. Participation was limited to publicly available models, including OpenAI's ChatGPT 5.5 Pro, along with academic teams from the University of California, Princeton University, and the Swiss Federal Institute of Technology in Zurich.

Teams from the University of California and the Swiss Federal Institute of Technology have developed what are known as "middle systems," which are systems in which one chatbot proposes solutions, while another reviews and verifies them, with information being exchanged between them several times as needed.

The Swiss Federal Institute of Technology (ETH Zurich) model performed best, successfully solving six out of ten problems. This system relied on improving ChatGPT responses through an "advisory board" composed of three advanced chatbots. The University of California team came in second with a ChatGPT-based assistant system, followed by the OpenAI team using ChatGPT without assistants, and then the Princeton University team using a Gemini 3.1 Pro-based system.

Despite this, no team was able to solve three out of ten problems. According to the participants, in some cases the systems lacked the core idea that humans intuitively grasp, while in others they succeeded in choosing the right approach but failed to execute the details accurately.

One of the most notable challenges observed was the phenomenon of "hallucination," where AI systems produced incorrect results even when asked to verify references. It was also noted that some models copied parts of published articles and sources without clear attribution.

The researchers pointed out that publishing these problems will allow other companies and institutions to use them in the future to test the capabilities of artificial intelligence systems and evaluate their performance in the face of complex mathematical challenges.


Post a Comment

Previous Post Next Post