Over the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple auditsOver the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple audits

Proof, Not Guesswork: An AI Audit Pipeline That Finds What Other Web3 Audits Miss

2026/03/03 14:07
6 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Over the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple audits, it still uncovered exploitable vulnerabilities. In one run alone, it surfaced eight reproducible issues, including High and Critical findings. That outcome is not luck. It is what a reproducible, multi-stage process produces: a short report and executable proof-of-concept files committed alongside the code. Every finding is backed by a runnable exploit that demonstrates the vulnerability in practice. Proof, not guesswork. In a sector where exploits are real and trust in “potential” findings is low, that bar is not optional.

Early on, I experimented with single-model scans and various AI audit tools. The pattern was consistent: long lists of potential issues, many false positives, and very little that could be demonstrated concretely. Closing the gap between “possible” and “provable” became the goal. The current pipeline grew out of that frustration with maybe-lists and unverified claims.

This is not a replacement for a formal audit. It is an additional, reproducible second opinion at the code level, a code review at scale. You still need a full human audit before mainnet. A reproducible, exploit-backed second pass should be standard practice for code that keeps evolving. This is the pass I run when I want to know what slipped through, or what appeared after the last audit.

What I Built (And Kept Iterating On)

What I built is a pipeline that produces a report and executable proof-of-concept files in the repository. Every finding includes a runnable exploit that demonstrates the issue. If it is in the report, there is a concrete proof-of-concept that shows how it can be triggered.

The filter is not heuristic. Every dropped finding receives a documented rejection reason: factually wrong, no valid attack path, design choice, duplicate, or out of scope. That is a structured quality gate, not gut feel. Only findings with a runnable, non-trivial proof-of-concept survive to the final report. When severity sources disagree, the lower severity is used.

When there is a prior audit report, such as a PDF, I ingest it so the pipeline does not re-flag already reported issues. The run is self-contained and sets up the test environment, such as Foundry, if needed. Output is structured to fit how teams and bug bounty platforms operate.

Pricing is a distinct service with its own scope, positioned between single-model scans and full audits. Full human audits typically range from $50k to $200k or more. This sits in between: repeatable, proof-driven, and scoped to code-level risk.

Scale and Models: What I’ve Been Running

Runs and codebases: Over the past few months I have executed dozens of full pipeline runs across more than 15 codebases. Many had already been audited by two or more teams. This is not a handful of reviews, but repeated application across real projects. The pipeline has been calibrated over many real runs; the current configuration is the result of continuous refinement, not a one-off design.

Getting to this point required substantial experimentation across models and configurations, as well as real spend. The current setup reflects what held up under adversarial challenge and what consistently produced reproducible results.

Explorers per run: Each run executes 8 to 10 explorer agents in parallel. A typical run takes several hours and produces 40 or more candidate findings before deduplication and challenge. The funnel reduces that to a single-digit or low-teens final report. The candidate-to-report ratio is often 3:1 or 4:1.

One run produced 11 findings with executable proof-of-concepts in the final report, including 3 High, 5 Medium, and 3 Low. Another produced 10 findings with proof-of-concepts, 2 overlapping with prior audits, and 8 new findings including High and Critical. Most of that report was new to the client.

Models in the mix: I do not rely on a single model. Multiple model families are used for exploration, challenge, and proof-of-concept construction. The mix is not trial and error. I track which models contribute unique High findings, which generate mostly noise, and which hold up under adversarial challenge. Model selection has been refined over many runs; the current set remains because the data supports it. Remove one and you can miss a finding. Retries and fallbacks ensure that each run completes even if a step encounters issues.

Why a Second Pass Finds Things

Human audits are finite. Edge cases slip through. Code changes. Refactors introduce new paths. A single audit is a snapshot; code is a moving system. Relying on one pass is comfortable, but risky when the attack surface is large.

A second pass does not imply the first audit was poor. It acknowledges that coverage is difficult.

The pipeline is not just another opinion. It works differently: multiple independent exploration paths, an adversarial challenger that questions every finding, conservative deduplication rules, and explicit validation where only issues backed by a working proof-of-concept survive. That is a methodologically different perspective.

Explorers perform structured analysis across protocol design, logic, economics, and attack paths. Findings are merged and deduplicated. A challenger attempts to invalidate them. Proof-of-concepts are built and executed in your framework, such as Foundry or Hardhat. Only findings backed by a working exploit remain in the final report. Everything rejected is logged with a documented reason.

The real pipeline includes significantly more stages than the simplified diagram below. Behind each stage are validation and control mechanisms. It is staged quality assurance, not a linear AI workflow. A full run spans multiple structured analysis and validation phases, emphasizing depth rather than runtime.

The Numbers

Dozens of candidates per run are systematically reduced to a small, defensible set. Only findings backed by runnable proof-of-concepts survive. That is what it takes to get to proof instead of guesswork.

If it is in the report, there is a working proof-of-concept demonstrating it. That is the bar.

When It Fits

Pre-launch as an additional pass. After a refactor or upgrade. Following governance or tokenomics changes. Before a raise or external audit. Narrow scopes such as a single module or integration.

The same fixed categories are applied every run. I do not publish the list. What ultimately appears in the report depends on the specific attack surface of your codebase.

What Kind of Hardening Shows Up

Across dozens of runs, certain patterns consistently surface. Governance parameters that can be zero or invalid on production paths. Withdrawal and queue logic that cannot progress under loss scenarios. First-depositor and reward front-running in share-based systems. Withdrawal flows that stall when one market becomes illiquid.

These are recurring classes of issues that appear under structured analysis and adversarial challenge. If your system has similar surface area, that is the kind of coverage this pipeline is designed to stress.

If a reproducible, exploit-backed second pass makes sense for your codebase, you know where to find me. Telegram: @Kurt0x


Proof, Not Guesswork: An AI Audit Pipeline That Finds What Other Web3 Audits Miss was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Market Opportunity
Notcoin Logo
Notcoin Price(NOT)
$0.0003563
$0.0003563$0.0003563
-3.59%
USD
Notcoin (NOT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Santander UK Announces Intention to Appoint Nicola Bannister as New TSB CEO

Santander UK Announces Intention to Appoint Nicola Bannister as New TSB CEO

Santander UK announced its intention to appoint Nicola Bannister as the new Chief Executive Officer of TSB Bank The post Santander UK Announces Intention to Appoint
Share
ffnews2026/03/03 08:00
CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Share
BitcoinEthereumNews2025/09/18 01:10
XRP Community Reacts as Ripple Prime Joins NSCC Directory

XRP Community Reacts as Ripple Prime Joins NSCC Directory

The post XRP Community Reacts as Ripple Prime Joins NSCC Directory appeared on BitcoinEthereumNews.com. Kelvin is a crypto journalist/editor with over six years
Share
BitcoinEthereumNews2026/03/03 17:34