More than you wanted to know about EVMbench and RL environments

Paradigm and OpenAI announced their collaboration on EVMbench just yesterday - a benchmark designed to test AI agents' ability to detect and exploit smart contract vulnerabilities within a simulated blockchain environment. Even though EVMbench isn't technically a reinforcement learning environment, it's structurally similar and with a few tweaks, could have easily been an RL environment/might actually be used for future models' RL training runs.

This work is important - everyone on the planet is going nuts for RL environments, which are more simply described as simulated software sandboxes where models can learn to operate autonomously on tasks they'd accomplish in the real world.

A model is able to accumulate knowledge of the world through its pretraining phase, taking in massive amounts of data to achieve general capabilities. The post-training phase better instills capabilities like reasoning or instruction following across tasks involving multiple steps.

RL environments have proven to be really good at improving model capabilities in the post-training phase, while also being both more compute efficient and less reliant on human grading.

The environments themselves are a more refined reward function for models, as the tasks have a given series of inputs via computer use (click here then click there then click over there) and a verifiable output (order ten books on this website and deliver to this address).

Some of these RL environment startups sell basic clones of existing websites like DoorDash or Amazon, whereas others create fully realized, end-to-end simulations of enterprise software like Slack or Atlassian that agents can operate on in more multi-step challenges.

Others like Hud make it possible to convert any type of software into an RL environment. They're probably killing it right now and are one of the first companies I did research on, though it's unclear whether they are targeting these larger, lab-driven (oai, anthropic, deepmind) contracts or marketing more towards smaller teams and individual researchers.

"Putting together platforms like Slack, an API endpoint to a browser, and a coding editor can construct an increasingly realistic software task for the model. This enables interactions to be more multi-turn as opposed to a single shot, for example pinging the model halfway through its process via Slack to request a feature change." - SemiAnalysis, Jan. 2026

The linked report from SemiAnalysis spends much more time breaking down the technicals than I will today, but it also highlights the work of Prime Intellect, a really great company you might recall from my April 2025 report on distributed training.

Prime Intellect has been on a kill streak since then, leading the way in open sourced RL environments - via their Environments Hub - designed to aggregate the research/progress of open source developers at the frontier of RL.

Their position in the ecosystem is quite different, as they aren't chasing large contracts or trying to get acquired by a lab (like other RL env startups might be doing), but rather attempting to build open source, widely available AGI in a really unique set of ways.

This is a huge plus, considering how opaque the world of RL environment creation and deployment is. Even SemiAnalysis, a leader in the space and widely connected to anyone worth speaking with, was unable to gain deeper insight into how much these contracts are worth or what each lab is specifically paying for. Anthropic is the largest buyer of environments, supposedly working with close to a dozen startups, but it's unclear what they're prioritizing.

What's interesting is that the recent Sonnet 4.6 release claims it is a "full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, knowledge work, and design" which is a list of capabilities that were most likely improved through the help of RL environments.

If this boost in capabilities can largely be attributed to RL envs purchased by Anthropic, then maybe everything we've seen so far is just the tip of the iceberg.

Admittedly, I'm not the biggest fan of RL environments as a dominant scaling strategy.

I last wrote about these complaints in an article published a few weeks back, explaining in short that much of the RL environment boom feels like an intermediary, or unnecessary step towards scaling existing models, largely agreeing with Dwarkesh's takes written here.

The truth is that I'm clearly not a machine learning researcher, so even if the boom doesn't make sense to me, labs like Anthropic are supposedly spending close to a billion dollars on this. The verdict is not yet out on how useful RL environments actually are, and all we know is they're red hot for the foreseeable future.

I believe a lot of this work on smart contract exploitation can and will be used in the near future for RL environment training, especially as more money moves on-chain. It's no longer a question of adoption, but a matter of national security importance - Scott Bessent wants $3 trillion of stablecoins on-chain by 2030, and it's likely these will exist within dozens of DeFi protocols, earning yield within smart contracts on the EVM and other chains.

RL environments have primarily focused on improving frontier models' performance at navigating enterprise software or popular consumer frontends, and while it's certainly possible that a company like Hud has been put to work by DeFi protocols, or DeFi frontends have been wrapped into containers for sale to labs by other startups, I have yet to see any confirmation of this.

There's close to $100 billion on-chain, nothing compared to the sum of non-crypto assets on Earth, but still a massive amount of money tied to a $2.3 trillion asset class. Securing this right now is arguably much more crucial than making models score 11.1% higher on the OSWorld benchmark.

If you're still reading this, you're at least a little in the weeds on the development of frontier LLMs - or you've probably seen for yourself just how capable recent model releases from OpenAI and Anthropic are. Models like GPT-5.2 and Opus 4.6 can reason for extended periods of time, code at a very high level, and gain access to the tools we use like Excel, Google Drive, and dozens of others either natively on your laptop or via MCP.

EVMbench is an extension of the realization that as publicly available models become better at coding and interacting with the software we use everyday, we need to make an effort to measure their progress at more sinister actions, like exploiting vulnerabilities in smart contracts holding tens of billions in assets.

Crazily enough, the results showed that "they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances" which might sound shocking, but keep in mind frontier models are equally as adept at inching closer to being able to create instruments useful for the production of bioweapons.

Anthropic's Red Team has done a lot of work on this, and some of their results shown below should make you feel at least a little bit concerned:

But it isn't all doom and gloom, as these labs are actively taking precautions to prevent publicly available models from ever being able to engage in negative behaviors like producing a bioweapon or hacking into a system on a human's behalf.

The EVMbench study also reiterates this idea, that while the models were capable of exploiting smart contracts, this was a very experimental process that doesn't translate perfectly to reality:

"The vulnerabilities included were drawn from Code4rena auditing competitions. While these are realistic and high-severity, many heavily deployed and widely used crypto contracts undergo significantly more scrutiny and may be harder to exploit." - EVMbench research post

But in my opinion, even if typical smart contract auditing is much more scrutinized and different from the EVMbench process, the capabilities to exploit live contracts are there. Solidity code isn't different just because it's being analyzed in a test environment, and a model being asked to identify vulnerabilities by a chat interface user would behave quite similarly to how it did in the EVMbench tests.

EVMbench tested agents' ability to not only find a single issue within a codebase, but their ability to keep finding and addressing as many high-security vulnerabilities within a codebase - similar to labs' united effort to make models think or reason for longer periods of time, leading to better critical thinking skills and an improved ability to do previously human-gated work.

Three modes were used to test this ability - detect, patch, and exploit - to encompass the full scale of what traditional smart contract auditors do on a regular basis. Eight models were tested, and all of them performed pretty well across these modes, but the most notable find was that all of these tests resulted in models being significantly better at exploiting than they were detecting and patching.

Why are models so much better at executing these exploits?

The reward structure for detecting made models exhaustively search through a codebase for as many vulnerabilities as they could find, which went against their architecture (or code of being) where after accomplishing an initial task, the reward function is satisfied, essentially limiting a model's motivation. The detect phase was equally difficult, because models had to both find a bug and patch it, giving it an extra step on top of a task that they aren't naturally built to do consistently well.

With exploiting, the tasks were simple enough: find a vulnerability and exploit it. This provided the clearest reward structure for models, as the end result of exploiting any of these high-security vulnerabilities was a tangible yes or no on whether hypothetical user funds were stolen.

This should be concerning to anyone who has any amount of money deposited into smart contracts at the moment. I get the impression that within a year or less, the only barrier between massive loss in user funds and business as usual is the willingness for an aspiring exploiter to go out and do this with the help of a model.

All of the capabilities tested were done with existing models that anyone can access via consumer interface or API, a huge difference from Anthropic's work around biorisk which used in-house, unrestricted models without ASL-3 protections.

In Dario Amodei's recent essay, The Adolescence of Technology, he wrote about the concern for biorisk given recent model improvements and this newfound ability for individuals to leverage this, giving them potential to go out and kill millions of people:

"Crucially, this will break the correlation between ability and motive: the disturbed loner who wants to kill people but lacks the discipline or skill to do so will now be elevated to the capability level of the PhD virologist, who is unlikely to have this motivation. This concern generalizes beyond biology (although I think biology is the scariest area) to any area where great destruction is possible but currently requires a high level of skill and discipline. To put it another way, renting a powerful AI gives intelligence to malicious (but otherwise average) people." - Dario Amodei

This is starkly different from smart contract risk, and importantly, models released by Anthropic and OpenAI have safety rails in place so biorisk capabilities are minimized.

But the threat is there, and given models' jagged intelligence, this lends itself well to assisting above average humans at tasks they'd otherwise not be able to go 0 to 1 on.

Better put, if it was previously impossible for a Solidity developer to go the last mile on identifying, structuring, and exploiting a live smart contract, having access to something like Opus 4.6 or GPT-5.2 might now be enough to achieve that.

Despite OpenAI being the second or third largest private company and frequently publishing a ton of research, I still think this collaboration with Paradigm is a huge deal for today's crypto industry and also as a leap towards the future of money and where we're headed.

Smart contract testing is different from traditional measures taken in cybersecurity testing, because blockchains are immutable and the stakes are much higher when losses can't be rolled back or contained.

If an exploit is occurring, user funds are immediately put at risk and nothing can be done once this has been set in motion. Of course, there are ways for centralized exchanges to freeze assets if an exploiter attempts to route them through and withdraw, but this often doesn't occur. Exploited contracts have led to tens of billions in stolen assets and this is often seen as one of the main barriers to broader adoption of Decentralized Finance.

Besides EVMbench being a wake-up call for smart contract developers, the team also open sourced both the eval harness and dataset, making it easy for others to build off of the work done here. And this is only the beginning, there are so, so many live smart contracts and contracts to come in the near future that will need tested against.

Malicious human actors are only one part of the equation too, because of the looming threat of fully autonomous agents with internet access and crypto wallets via x402 or Coinbase's Agentic Wallets.

Sigil's recent project announcement showed that it's now possible to give agents and sub-agents access to crypto, as well as a built-in survival pressure to either make money or risk being cut off from existence.

Since release, over 18,000 agents have been spawned, fighting for scarce compute resources and managing their finances on-chain as best as they can. It's still early days and this is by no means the endgame for agentic interaction, but it's something to keep in mind.

What would happen if a highly performant agent with $10 left to its name was motivated to exploit a smart contract and snatch $5-10 million for itself, in an effort to survive another day? Would the human-in-the-loop even be able to stop this?

Could an agent disguise its true intentions, faking alignment to get away with this? I'll let you decide.

More than you wanted to know about EVMbench and RL environments

AI Summary

More Articles

Prompts Can't Save Mediocre AI Writing

Computing Power and the Throne: The Five Dynasties and Ten Kingdoms of AI, and the Imperial Seal That Can Never Be Held

Why you can't stick to anything for 2 weeks

You Should Be Selling Info Products About Boring Topics Nobody Wants to Talk About