OpenZeppelin Flags Data Contamination in OpenAI EVMBench

OpenZeppelin says it found data contamination and methodological flaws in OpenAI’s EVMBench, a smart contract security benchmark built with Paradigm.

The security auditor also said the dataset includes invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice.

Stay ahead in the crypto world – follow us on X for the latest updates, insights, and trends!🚀

EVMBench launched as a smart contract security benchmark that tests whether AI agents can identify, patch, and exploit vulnerabilities in code.

OpenZeppelin said it reviewed the benchmark with the same approach it uses on real security work.

OpenZeppelin published its findings on March 2, 2026. It framed the review around two main problems: EVMBench data contamination and high severity vulnerability classifications it considers invalid.

OpenZeppelin EVMBench Audit Announcement. Source: OpenZeppelin on X

OpenZeppelin EVMBench data contamination and training data leakage concern

OpenZeppelin said the key skill in AI security is finding novel vulnerabilities in code a model has never seen. It said memorizing public reports does not show that skill.

OpenZeppelin also said top scoring models likely saw EVMBench vulnerability reports during pretraining.

It described that exposure as a form of training data leakage that can weaken benchmark signals.

EVMBench testing cut off internet access for the AI agents.

However, OpenZeppelin said the benchmark used curated vulnerabilities from 120 audits dated between 2024 and mid 2025. It said many model knowledge cutoffs also sit around mid 2025, so overlap risks rise.

Invalid vulnerability classifications and at least four high severity vulnerabilities

OpenZeppelin said it found classification problems inside the dataset. It said the dataset includes “invalid vulnerability classifications,” including at least four findings labeled high severity that are not exploitable in practice.

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,”

OpenZeppelin said.

OpenZeppelin said these cases do not reflect a severity debate. It said the described exploit does not work.

It also said EVMBench still scored AI agents as correct for finding these supposedly high severity issues.

OpenAI Paradigm benchmark setup, model rankings, and why EVMBench testing may shift

EVMBench results ranked models based on how they handled smart contract vulnerability tasks. The published ranking listed Anthropic’s Claude Open 4.6 first, OpenAI’s OC GPT 5.2 second, and Google’s Gemini 3 Pro third.

OpenZeppelin said the dataset’s limited size makes contamination more important. It said a narrow evaluation surface can amplify the effect of any overlap with pretraining materials.

“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,”

OpenZeppelin said. It tied that point directly to EVMBench data contamination and to how the benchmark measures smart contract security performance.

Disclosure:This article does not contain investment advice or recommendations. Every investment and trading move involves risk, and readers should conduct their own research when making a decision.

Kriptoworld.com accepts no liability for any errors in the articles or for any financial loss resulting from incorrect information.

Tatevik Avetisyan
Editor at Kriptoworld
LinkedIn | X (Twitter)

Tatevik Avetisyan is an editor at Kriptoworld who covers emerging crypto trends, blockchain innovation, and altcoin developments. She is passionate about breaking down complex stories for a global audience and making digital finance more accessible.

📅 Published: March 3, 2026 • 🕓 Last updated: March 3, 2026

LATEST POSTS

OpenZeppelin Raises a Red Flag on OpenAI’s EVMBench Data

OpenZeppelin EVMBench data contamination and training data leakage concern

Invalid vulnerability classifications and at least four high severity vulnerabilities

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,”

OpenAI Paradigm benchmark setup, model rankings, and why EVMBench testing may shift

“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,”

LATEST POSTS

Polymarket Insider Trading Backlash Forces New Surveillance Push

Crypto Market April 2026 Hit by CFTC Lawsuits, $30B RWA Milestone, and Bitcoin ATM Bans

Judge Shuts Down Sam Bankman-Fried’s New Trial Bid

CFTC Escalates Prediction Market Fight With Wisconsin Lawsuit

Follow us on X too!

Most Popular

Ethereum’s Bitcoin Slump May Be Nearing an End as CLARITY Act Gains Momentum

Rising Japanese Bond Yields Are Repricing Global Liquidity Conditions

Consumer Resilience and AI Spending Continue Delaying Aggressive Fed Easing

Prediction Markets Are Emerging as Real-Time Infrastructure for Pricing Uncertainty

FOMC Minutes Highlight AI’s Growing Impact on Inflation and Liquidity

Guest posts

Step into the World of Online Gambling: What You Need to Know

Is the Solana ETF on the Horizon? Could It Ignite a Real Altcoin Season?

Dogecoin Backers Now Bagging Mog Coin and Raboo as Crypto Market Experts Foresee Massive ROI

Low Investments, High Opportunities: Expert Crypto Picks With Huge Potential ROI in 2024

[email protected]

Useful links

Latest posts

Ethereum’s Bitcoin Slump May Be Nearing an End as CLARITY Act Gains Momentum

Rising Japanese Bond Yields Are Repricing Global Liquidity Conditions

Consumer Resilience and AI Spending Continue Delaying Aggressive Fed Easing

Popular categories