Developers claim that in just a few months, AI systems will be able to score a perfect score on one of the world's toughest cognitive tests, called the "HLE" (Hate Test of Humanity).
This test, designed by technology officials to measure the intelligence of their systems, consists of 2,500 carefully selected questions covering about 100 different topics, ranging from rocket science and mythology to physiology.
Each question requires a level of understanding no less than a PhD, and anyone who gets a score close to 100% is considered a "world expert".
Just two years ago, OpenAI's ChatGPT scored only 3% on this test, and Google and Anthropic's systems fared little better. At the time, these results helped to allay fears of AI's dominance, demonstrating a significant gap between large language models and the world's leading academics.
But this seemingly impossible test may just be another milestone in the rapid advance of artificial intelligence. Just last month, Google's Gemini system scored 45.9%, up from 18.8% only months after its initial attempt.
Calvin Chang, head of research at Scale (the company responsible for the test), says: "We wanted to create an academic test at the level of human experts, that only a handful of people on Earth could solve. But we have seen incredible progress in language models over the past few years, and developers are doing a great job of improving the reasoning abilities of these models."
Kate Olchevska, product manager at Google DeepMind, adds, "If this is our sole purpose in life, I think we'll reach it very quickly." Anthropic (maker of the Cloud AI system) achieved a score of 34.2% and is rapidly improving its results.
Achieving a 100% score on this test would be a significant development, as its creators say it is "designed to be the last closed-ended academic test of its kind." This means that if artificial intelligence can solve this test, we will need to test it in the future with questions that no human knows the answer to.
The test was created in collaboration with the Center for AI Safety, a non-profit organization, to measure the knowledge and depth of AI's thinking.
In September 2024, the test's organizers launched a global call for questions, offering a $500,000 prize. Experts from approximately 50 countries responded, submitting 70,000 questions, with the requirement that the answers be short, clear, and not readily available online.
Questions that existing AI models could answer were eliminated, narrowing the list down to 13,000. A final selection of 2,500 questions was then made, with some adjustments later based on user feedback. Many of these questions remain confidential to prevent systems from exploiting answers circulating online.
Success in this test is reminiscent of IBM's supercomputer Deep Blue's victory over world chess champion Garry Kasparov in 1997, a feat that surprised most experts. Since then, several major AI tests have been surpassed, such as the 2020 Massive Multitasking Language Understanding (MMLU) test, which was later discontinued after becoming too easy for the systems, scoring over 90%.
Olshevska adds that as AI approaches mastering human-level testing, developers' main focus has become pushing the boundaries of current human knowledge. However, Zhang argues that there will always be room for human expertise, as AI struggles to master practical areas like surgery, as well as decision-making skills such as sound judgment and creativity.
