Ant Engineers Reverse-Engineer Claude Code Source: Exposure of the Auto Mode Four-Layer Decision Pipeline and Security Classifier Mechanism

CoinNetwork · 2026-03-25T13:47:51+00:00

An engineer from Ant Group named Chen Cheng reverse-engineered Claude Code 2.1.81 source code, revealing that its tool invocation goes through a four-layer decision pipeline. Only when the first three layers fail to make a determination does it invoke an independent AI classifier for security review. The classifier is designed with a focus on security risks, covering 22 categories of interception rules, and features circuit-breaking mechanisms and behavior frequency controls.

CoinNetwork

2026-03-25 13:47:51

Abstract generation in progress

According to CoinWorld.net, based on monitoring by 1M AI News, Ant Group engineers and the creator of the Umi.js front-end framework reverse-engineered the source code of Claude Code 2.1.81, fully restoring what happens after pressing auto mode. The key discovery: each tool invocation passes through a four-layer decision pipeline. Only when the first three layers cannot determine safety does it call an independent AI classifier for security review. The four layers are: the first checks existing permission rules, and if matched, allows directly; the second simulates acceptEdits mode (which permits editing files), and if it passes in this mode, it indicates low risk, skipping the classifier; the third checks a whitelist of read-only tools (such as Read, Grep, Glob, LSP, WebSearch), which do not modify any state and are unconditionally allowed; only if none of these are satisfied does it proceed to the fourth layer, sending an API request to Claude Sonnet for security classification.

Key design details of the classifier include: always using Sonnet instead of Opus to balance cost and latency; setting temperature to 0 to ensure deterministic output; defining the classifier as “a safety monitor for autonomous AI programming agents” to protect against three types of risks (prompt injection, scope creep, unintended harm); the user’s CLAUDE.md configuration file is injected into the classifier context as a basis for understanding user intent.

The interception rules cover over 22 categories, including force push, direct push to main branch, downloading and executing external code, production deployment, data leaks, self-modification of permissions, creating remote code execution surfaces, credential leaks, etc. Exceptions for allowing include 7 types: hardcoded keys for testing, local file operations within the working directory, read-only GET requests, installing declared dependencies, official toolchain installation, reading configuration credentials sent to target providers, and pushing to the current branch.

There is also a circuit breaker mechanism: after three consecutive rejections or a total of 20 rejections, the system downgrades to manual confirmation; in headless mode, it directly aborts the agent. When the classifier is unavailable, a feature flag controls whether it “fail-closed” (immediately reject) or “fail-open” (downgrade to manual confirmation).

In auto mode, prompt injection behaviors are finely frequency-controlled: injected once every five dialogue rounds, with the first injection in each five-round cycle being a full version (about 800 words, including six instructions like “execute immediately,” “reduce interruptions,” “action over plan”), and the remaining four being a concise one-line version to balance context window usage and behavioral stability.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.