OpenZeppelin says it found data contamination and methodological flaws in OpenAI’s EVMBench, a smart contract security benchmark built with Paradigm.
The security auditor also said the dataset includes invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice.
Stay ahead in the crypto world – follow us on X for the latest updates, insights, and trends!🚀
EVMBench launched as a smart contract security benchmark that tests whether AI agents can identify, patch, and exploit vulnerabilities in code.
OpenZeppelin said it reviewed the benchmark with the same approach it uses on real security work.
OpenZeppelin published its findings on March 2, 2026. It framed the review around two main problems: EVMBench data contamination and high severity vulnerability classifications it considers invalid.

OpenZeppelin EVMBench data contamination and training data leakage concern
OpenZeppelin said the key skill in AI security is finding novel vulnerabilities in code a model has never seen. It said memorizing public reports does not show that skill.
OpenZeppelin also said top scoring models likely saw EVMBench vulnerability reports during pretraining.
It described that exposure as a form of training data leakage that can weaken benchmark signals.
EVMBench testing cut off internet access for the AI agents.
However, OpenZeppelin said the benchmark used curated vulnerabilities from 120 audits dated between 2024 and mid 2025. It said many model knowledge cutoffs also sit around mid 2025, so overlap risks rise.
Invalid vulnerability classifications and at least four high severity vulnerabilities
OpenZeppelin said it found classification problems inside the dataset. It said the dataset includes “invalid vulnerability classifications,” including at least four findings labeled high severity that are not exploitable in practice.
“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,”
OpenZeppelin said.
OpenZeppelin said these cases do not reflect a severity debate. It said the described exploit does not work.
It also said EVMBench still scored AI agents as correct for finding these supposedly high severity issues.
OpenAI Paradigm benchmark setup, model rankings, and why EVMBench testing may shift
EVMBench results ranked models based on how they handled smart contract vulnerability tasks. The published ranking listed Anthropic’s Claude Open 4.6 first, OpenAI’s OC GPT 5.2 second, and Google’s Gemini 3 Pro third.
OpenZeppelin said the dataset’s limited size makes contamination more important. It said a narrow evaluation surface can amplify the effect of any overlap with pretraining materials.
“While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,”
OpenZeppelin said. It tied that point directly to EVMBench data contamination and to how the benchmark measures smart contract security performance.
Disclosure:This article does not contain investment advice or recommendations. Every investment and trading move involves risk, and readers should conduct their own research when making a decision.
Kriptoworld.com accepts no liability for any errors in the articles or for any financial loss resulting from incorrect information.
Tatevik Avetisyan is an editor at Kriptoworld who covers emerging crypto trends, blockchain innovation, and altcoin developments. She is passionate about breaking down complex stories for a global audience and making digital finance more accessible.
📅 Published: March 3, 2026 • 🕓 Last updated: March 3, 2026

