ExchangeDEX+

Buy Crypto Markets Spot FuturesGOLD Earn Event Center

Over the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple auditsOver the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple audits

Proof, Not Guesswork: An AI Audit Pipeline That Finds What Other Web3 Audits Miss

Author: Medium

Source: Medium

2026/03/03 14:07

6 min read

NOT$0.0003563+1.39%

For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Over the past few months I’ve been building and running an AI pipeline that only reports what it can prove. On codebases that had already passed multiple audits, it still uncovered exploitable vulnerabilities. In one run alone, it surfaced eight reproducible issues, including High and Critical findings. That outcome is not luck. It is what a reproducible, multi-stage process produces: a short report and executable proof-of-concept files committed alongside the code. Every finding is backed by a runnable exploit that demonstrates the vulnerability in practice. Proof, not guesswork. In a sector where exploits are real and trust in “potential” findings is low, that bar is not optional.

Early on, I experimented with single-model scans and various AI audit tools. The pattern was consistent: long lists of potential issues, many false positives, and very little that could be demonstrated concretely. Closing the gap between “possible” and “provable” became the goal. The current pipeline grew out of that frustration with maybe-lists and unverified claims.

This is not a replacement for a formal audit. It is an additional, reproducible second opinion at the code level, a code review at scale. You still need a full human audit before mainnet. A reproducible, exploit-backed second pass should be standard practice for code that keeps evolving. This is the pass I run when I want to know what slipped through, or what appeared after the last audit.

What I Built (And Kept Iterating On)

What I built is a pipeline that produces a report and executable proof-of-concept files in the repository. Every finding includes a runnable exploit that demonstrates the issue. If it is in the report, there is a concrete proof-of-concept that shows how it can be triggered.

The filter is not heuristic. Every dropped finding receives a documented rejection reason: factually wrong, no valid attack path, design choice, duplicate, or out of scope. That is a structured quality gate, not gut feel. Only findings with a runnable, non-trivial proof-of-concept survive to the final report. When severity sources disagree, the lower severity is used.

When there is a prior audit report, such as a PDF, I ingest it so the pipeline does not re-flag already reported issues. The run is self-contained and sets up the test environment, such as Foundry, if needed. Output is structured to fit how teams and bug bounty platforms operate.

Pricing is a distinct service with its own scope, positioned between single-model scans and full audits. Full human audits typically range from $50k to $200k or more. This sits in between: repeatable, proof-driven, and scoped to code-level risk.

Scale and Models: What I’ve Been Running

Runs and codebases: Over the past few months I have executed dozens of full pipeline runs across more than 15 codebases. Many had already been audited by two or more teams. This is not a handful of reviews, but repeated application across real projects. The pipeline has been calibrated over many real runs; the current configuration is the result of continuous refinement, not a one-off design.

Getting to this point required substantial experimentation across models and configurations, as well as real spend. The current setup reflects what held up under adversarial challenge and what consistently produced reproducible results.

Explorers per run: Each run executes 8 to 10 explorer agents in parallel. A typical run takes several hours and produces 40 or more candidate findings before deduplication and challenge. The funnel reduces that to a single-digit or low-teens final report. The candidate-to-report ratio is often 3:1 or 4:1.

One run produced 11 findings with executable proof-of-concepts in the final report, including 3 High, 5 Medium, and 3 Low. Another produced 10 findings with proof-of-concepts, 2 overlapping with prior audits, and 8 new findings including High and Critical. Most of that report was new to the client.

Models in the mix: I do not rely on a single model. Multiple model families are used for exploration, challenge, and proof-of-concept construction. The mix is not trial and error. I track which models contribute unique High findings, which generate mostly noise, and which hold up under adversarial challenge. Model selection has been refined over many runs; the current set remains because the data supports it. Remove one and you can miss a finding. Retries and fallbacks ensure that each run completes even if a step encounters issues.

Why a Second Pass Finds Things

Human audits are finite. Edge cases slip through. Code changes. Refactors introduce new paths. A single audit is a snapshot; code is a moving system. Relying on one pass is comfortable, but risky when the attack surface is large.

A second pass does not imply the first audit was poor. It acknowledges that coverage is difficult.

The pipeline is not just another opinion. It works differently: multiple independent exploration paths, an adversarial challenger that questions every finding, conservative deduplication rules, and explicit validation where only issues backed by a working proof-of-concept survive. That is a methodologically different perspective.

Explorers perform structured analysis across protocol design, logic, economics, and attack paths. Findings are merged and deduplicated. A challenger attempts to invalidate them. Proof-of-concepts are built and executed in your framework, such as Foundry or Hardhat. Only findings backed by a working exploit remain in the final report. Everything rejected is logged with a documented reason.

The real pipeline includes significantly more stages than the simplified diagram below. Behind each stage are validation and control mechanisms. It is staged quality assurance, not a linear AI workflow. A full run spans multiple structured analysis and validation phases, emphasizing depth rather than runtime.

The Numbers

Dozens of candidates per run are systematically reduced to a small, defensible set. Only findings backed by runnable proof-of-concepts survive. That is what it takes to get to proof instead of guesswork.

If it is in the report, there is a working proof-of-concept demonstrating it. That is the bar.

When It Fits

Pre-launch as an additional pass. After a refactor or upgrade. Following governance or tokenomics changes. Before a raise or external audit. Narrow scopes such as a single module or integration.

The same fixed categories are applied every run. I do not publish the list. What ultimately appears in the report depends on the specific attack surface of your codebase.

What Kind of Hardening Shows Up

Across dozens of runs, certain patterns consistently surface. Governance parameters that can be zero or invalid on production paths. Withdrawal and queue logic that cannot progress under loss scenarios. First-depositor and reward front-running in share-based systems. Withdrawal flows that stall when one market becomes illiquid.

These are recurring classes of issues that appear under structured analysis and adversarial challenge. If your system has similar surface area, that is the kind of coverage this pipeline is designed to stress.

If a reproducible, exploit-backed second pass makes sense for your codebase, you know where to find me. Telegram: @Kurt0x

Proof, Not Guesswork: An AI Audit Pipeline That Finds What Other Web3 Audits Miss was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Market Opportunity

Notcoin Price(NOT)

$0.0003563

$0.0003563$0.0003563

-3.59%

USD

Notcoin (NOT) Live Price Chart

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.