OpenZeppelin Raises a Red Flag on OpenAI’s EVMBench Data

-

OpenZeppelin says it found data contamination and methodological flaws in OpenAI’s EVMBench, a smart contract security benchmark built with Paradigm.

The security auditor also said the dataset includes invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice.

Stay ahead in the crypto world – follow us on X for the latest updates, insights, and trends!🚀

EVMBench launched as a smart contract security benchmark that tests whether AI agents can identify, patch, and exploit vulnerabilities in code.

OpenZeppelin said it reviewed the benchmark with the same approach it uses on real security work.

OpenZeppelin published its findings on March 2, 2026. It framed the review around two main problems: EVMBench data contamination and high severity vulnerability classifications it considers invalid.

OpenZeppelin EVMBench Audit Announcement. Source: OpenZeppelin on X
OpenZeppelin EVMBench Audit Announcement. Source: OpenZeppelin on X

OpenZeppelin EVMBench data contamination and training data leakage concern

OpenZeppelin said the key skill in AI security is finding novel vulnerabilities in code a model has never seen. It said memorizing public reports does not show that skill.

OpenZeppelin also said top scoring models likely saw EVMBench vulnerability reports during pretraining.

It described that exposure as a form of training data leakage that can weaken benchmark signals.

EVMBench testing cut off internet access for the AI agents.

However, OpenZeppelin said the benchmark used curated vulnerabilities from 120 audits dated between 2024 and mid 2025. It said many model knowledge cutoffs also sit around mid 2025, so overlap risks rise.

Invalid vulnerability classifications and at least four high severity vulnerabilities

OpenZeppelin said it found classification problems inside the dataset. It said the dataset includes “invalid vulnerability classifications,” including at least four findings labeled high severity that are not exploitable in practice.

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,”

OpenZeppelin said.

OpenZeppelin said these cases do not reflect a severity debate. It said the described exploit does not work.

It also said EVMBench still scored AI agents as correct for finding these supposedly high severity issues.

OpenAI Paradigm benchmark setup, model rankings, and why EVMBench testing may shift

EVMBench results ranked models based on how they handled smart contract vulnerability tasks. The published ranking listed Anthropic’s Claude Open 4.6 first, OpenAI’s OC GPT 5.2 second, and Google’s Gemini 3 Pro third.

OpenZeppelin said the dataset’s limited size makes contamination more important. It said a narrow evaluation surface can amplify the effect of any overlap with pretraining materials.

“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,”

OpenZeppelin said. It tied that point directly to EVMBench data contamination and to how the benchmark measures smart contract security performance.


Disclosure:This article does not contain investment advice or recommendations. Every investment and trading move involves risk, and readers should conduct their own research when making a decision.

Kriptoworld.com accepts no liability for any errors in the articles or for any financial loss resulting from incorrect information.

Tatevik Avetisyan
Tatevik Avetisyan
Editor at Kriptoworld
LinkedIn | X (Twitter)

Tatevik Avetisyan is an editor at Kriptoworld who covers emerging crypto trends, blockchain innovation, and altcoin developments. She is passionate about breaking down complex stories for a global audience and making digital finance more accessible.

📅 Published: March 3, 2026 • 🕓 Last updated: March 3, 2026

LATEST POSTS

Meta 1 Coin Fraud Case Ends With 23 Year Prison Sentence for Texas Man

A Texas man convicted in the Meta 1 Coin fraud case has been sentenced to 23 years in federal prison after authorities said the scheme...

Fake Ledger Wallet Scam Exposed After Counterfeit Device Fails Security Chec

A Brazilian security researcher has warned crypto users after finding a fake Ledger wallet sold through a Chinese marketplace. The counterfeit device looked real, matched...

World Liberty Financial Faces WLFI Backlash Over New Token Unlock Plan

World Liberty Financial is facing criticism after a new token unlock plan proposed a longer lock period for early WLFI investors. The proposal, posted on...

Crypto Valley Funding Jumps as TON Deal Lifts Switzerland’s 2025 Total

Crypto Valley funding reached $728 million across 31 deals in 2025, according to a new CV VC report. The figure put Switzerland Crypto Valley at...
122FollowersFollow

Most Popular

Guest posts