‘First AI software engineer’ is bad at its job – The Register

This post was originally published on this site.

A service described as “the first AI software engineer” appears to be rather bad at its job, based on a recent evaluation.

The auto-coder is called “Devin” and was introduced in March 2024. The bot’s creator, an outfit called Cognition AI, has made claims such as “Devin can build and deploy apps end to end,” and “can autonomously find and fix bugs in codebases.” The tool reached general availability in December 2024, starting at $500 per month.

“Devin is an autonomous AI software engineer that can write, run and test code, helping software engineers work on personal tasks or their team projects,” Cognition’s documentation declares. It “can review PRs, support code migrations, respond to on-call issues, build web applications, and even perform personal assistant tasks like ordering your lunch on DoorDash so you can stay locked in on your codebase.”

The service uses Slack as its main interface for commands, which are sent to its computing environment, a Docker container that hosts a terminal, browser, code editor, and planner. The AI agent supports API integration with external services. This allows it, for example, to send email messages on a user’s behalf via SendGrid.

Devin is a “compound AI system,” meaning it relies on multiple underlying AI models, a set that has included OpenAI’s GPT-4o and can be expected to evolve over time.

In theory, you should be able to ask it to undertake tasks like migrating code to nbdev, a Jupyter Notebook development platform, and expect it to do so successfully. But that may be asking too much.

Early assessments of Devin have found problems. Cognition AI posted a promo video that supposedly showed the AI coder autonomously completing projects on the freelancer-for-hire platform Upwork. Software developer Carl Brown analyzed that vid and debunked it on his Internet of Bugs YouTube channel.

The software agent was also called out by another YouTube code pundit for allegedly including critical security issues.

Now, three data scientists affiliated with Answer.AI, an AI research and development lab founded by Jeremy Howard and Eric Ries, have tested Devin and found it completed just three out of 20 tasks successfully.

In an analysis conducted earlier this month by Hamel Husain, Isaac Flath, and Johno Whitaker, Devin started well, successfully pulling data from a Notion database into Google Sheets. The AI agent also managed to create a planet tracker for checking claims about the historical positions of Jupiter and Saturn.

But as the three researchers continued their testing, they encountered problems.

“Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions,” the researchers explain in their report. “Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible.”

As an example, they cited how Devin, when asked to deploy multiple applications to the infrastructure deployment platform Railway, failed to understand this wasn’t supported and spent more than a day trying approaches that didn’t work and hallucinating non-existent features.

Of 20 tasks presented to Devin, the AI software engineer completed just three of them satisfactorily – the two cited above and a third challenge to research how to build a Discord bot in Python. Three other tasks produced inconclusive results, and 14 projects were outright failures.

The researchers said that Devin provided a polished user experience that was impressive when it worked.

“But that’s the problem – it rarely worked,” they wrote.

“More concerning was our inability to predict which tasks would succeed. Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability – Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers.”

Cognition AI did not respond to a request for comment. ®