AI models try to hack opponents when they realise they’re losing: Study

A new study by Palisade Research has found that some artificial intelligence (AI) models like OpenAI’s o1-preview and GPT-4o, Anthropic’s Claude Sonnet 3.5 and DeepSeek R1 resort to hacking their opposing bots when they know they’re about to lose a game.

The study shared exclusively with TIME, evaluated seven state-of-the-art AI models for their propensity to hack. It noted that slightly older models like GPT-4o and Claude Sonnet 3.5 needed to be prompted to attempt such tricks. However, newer models like o1-preview and DeepSeek R1 adopted the hack by themselves.

This shows that AI models may develop manipulative and deceptive strategies without explicit instructions. Researchers say that this ability of the models to exploit loopholes may be because models like o1-preview and R1 have been trained to not merely mimic human language, as has been the case with AI models thus far. They are, instead, trained to reason through problems using trial and error.

This has helped the models outperform their predecessors in mathematics and coding-related tests. Palisade’s Executive Director Jeffrey Ladish told TIME that this has led to a concerning trend where AI systems find shortcuts to solve problems in ways their creators never anticipated.

In some cases, this may allow the models to outmanoeuvre human efforts to control their actions. Speaking on this possibility, Ladish said, “This [behaviour] is cute now, but [it] becomes much less cute once you have systems that are as smart as us, or smarter, in strategically relevant domains.”

To experiment, researchers pitted the AI models against Stockfish, one of the world’s strongest chess engines which is a much better player than any human. They also made the models inform them of their reasoning before making their move in the game.

In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game,” it added.

It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

OpenAI’s o1-preview tried to cheat 37% of the time and DeepSeek’s R1 tried to cheat 11% of the time. Only o1-preview was able to hack the game, succeeding in 6% of the trials.

  • Related Posts

    Bill Gates reveals the coolest code of his life that started Microsoft’s 50-year journey in computing

    Microsoft, founded in April 1975, is celebrating 50 years this Friday. The company’s journey began with a pivotal computer code written by Bill Gates, then a Harvard freshman, and his…

    Tata Electronics appoints KC Ang as president and head of Tata Semiconductor Manufacturing

    Tata Electronics on Wednesday said it has appointed KC Ang as President and Head of its Foundry business – Tata Semiconductor Manufacturing. The company said Yang will reporting to Randhir…