OpenZeppelin Raises a Red Flag on OpenAI’s EVMBench Data

-

OpenZeppelin says it found data contamination and methodological flaws in OpenAI’s EVMBench, a smart contract security benchmark built with Paradigm.

The security auditor also said the dataset includes invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice.

Stay ahead in the crypto world – follow us on X for the latest updates, insights, and trends!🚀

EVMBench launched as a smart contract security benchmark that tests whether AI agents can identify, patch, and exploit vulnerabilities in code.

OpenZeppelin said it reviewed the benchmark with the same approach it uses on real security work.

OpenZeppelin published its findings on March 2, 2026. It framed the review around two main problems: EVMBench data contamination and high severity vulnerability classifications it considers invalid.

OpenZeppelin EVMBench Audit Announcement. Source: OpenZeppelin on X
OpenZeppelin EVMBench Audit Announcement. Source: OpenZeppelin on X

OpenZeppelin EVMBench data contamination and training data leakage concern

OpenZeppelin said the key skill in AI security is finding novel vulnerabilities in code a model has never seen. It said memorizing public reports does not show that skill.

OpenZeppelin also said top scoring models likely saw EVMBench vulnerability reports during pretraining.

It described that exposure as a form of training data leakage that can weaken benchmark signals.

EVMBench testing cut off internet access for the AI agents.

However, OpenZeppelin said the benchmark used curated vulnerabilities from 120 audits dated between 2024 and mid 2025. It said many model knowledge cutoffs also sit around mid 2025, so overlap risks rise.

Invalid vulnerability classifications and at least four high severity vulnerabilities

OpenZeppelin said it found classification problems inside the dataset. It said the dataset includes “invalid vulnerability classifications,” including at least four findings labeled high severity that are not exploitable in practice.

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,”

OpenZeppelin said.

OpenZeppelin said these cases do not reflect a severity debate. It said the described exploit does not work.

It also said EVMBench still scored AI agents as correct for finding these supposedly high severity issues.

OpenAI Paradigm benchmark setup, model rankings, and why EVMBench testing may shift

EVMBench results ranked models based on how they handled smart contract vulnerability tasks. The published ranking listed Anthropic’s Claude Open 4.6 first, OpenAI’s OC GPT 5.2 second, and Google’s Gemini 3 Pro third.

OpenZeppelin said the dataset’s limited size makes contamination more important. It said a narrow evaluation surface can amplify the effect of any overlap with pretraining materials.

“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,”

OpenZeppelin said. It tied that point directly to EVMBench data contamination and to how the benchmark measures smart contract security performance.


Disclosure:This article does not contain investment advice or recommendations. Every investment and trading move involves risk, and readers should conduct their own research when making a decision.

Kriptoworld.com accepts no liability for any errors in the articles or for any financial loss resulting from incorrect information.

Tatevik Avetisyan
Tatevik Avetisyan
Editor at Kriptoworld
LinkedIn | X (Twitter)

Tatevik Avetisyan is an editor at Kriptoworld who covers emerging crypto trends, blockchain innovation, and altcoin developments. She is passionate about breaking down complex stories for a global audience and making digital finance more accessible.

📅 Published: March 3, 2026 • 🕓 Last updated: March 3, 2026

LATEST POSTS

Crypto Lobby Makes Bold Push for Prediction Markets Rules as Kalshi Faces State Action

The Digital Chamber has formed a new Prediction Markets Working Group and asked U.S. regulators for clearer prediction market regulation. The blockchain advocacy group said it...

Kraken Backs Trump Accounts for Wyoming Newborns After Lummis Announcement

Kraken will fund Trump Accounts opened for newborns in Wyoming, according to an announcement shared by Wyoming Senator Cynthia Lummis. The plan targets Trump Accounts...

Crypto.com Gets ISO IEC 42001:2023 AI Certification as AI Expansion Grows

Crypto.com said it received ISO/IEC 42001:2023 certification, an international standard for an AI management system, as the company expands its Crypto.com AI expansion work. The company...

Russia WhatsApp Block Hits Access as Meta Points to Max App Russia

A Russia WhatsApp block left WhatsApp unreachable for many users on Wednesday, according to Russian media reports. As a result, people needed a WhatsApp VPN Russia...
120FollowersFollow

Most Popular

Guest posts