How InferenceWall Detects Threats

Every scan request passes through a multi-layer detection pipeline. Each layer uses a different technique to catch a different class of attack. The results are combined through anomaly scoring, and a single decision — allow, flag, or block — is returned to your application.

Detection pipeline

Heuristic engine (Rust, sub-0.3ms p99)

The first layer is a Rust-powered heuristic engine that runs regex, substring (Aho-Corasick), encoding detection, and Unicode obfuscation checks against all 75 heuristic signatures. Because it runs in native code, p99 latency stays under 0.3ms regardless of input length.This layer covers direct injection commands, jailbreak phrases, obfuscated payloads (base64, ROT13, homoglyphs), and PII/credential patterns in output.

ML classifier engine (ONNX — DeBERTa/DistilBERT)

Available in the Standard and Full profiles, the ML classifier loads quantized DeBERTa and DistilBERT models via ONNX Runtime. It runs 11 classifier signatures against content that the heuristic layer did not catch with high confidence — things like few-shot poisoning, translation bypass, and tool response injection.The classifier produces a confidence score (0.0–1.0) for each match, which feeds directly into anomaly scoring.

Semantic similarity engine (FAISS + MiniLM)

The semantic engine uses FAISS vector search with MiniLM embeddings to catch paraphrased attacks — inputs that carry the same adversarial intent as known attacks but use different wording. It runs 10 semantic signatures covering paraphrased instruction overrides, role hijacking, social engineering, and indirect injection via tool output.This layer is what catches attacks that have been deliberately reworded to evade pattern-based detection.

LLM-judge (Phi-4 Mini Q4)

Available in the Full profile, the LLM-judge runs only when the accumulated score is in a borderline range — high enough to warrant scrutiny but below the early-exit threshold. It uses a quantized Phi-4 Mini model to reason about whether the input is genuinely adversarial, reducing false positives on ambiguous inputs.The judge runs composite signatures for multi-turn escalation, RAG poisoning, payload splitting, and language-switch bypass.

If the score reaches the early_exit threshold (default 13.0) after any layer, the pipeline skips all downstream engines. This prevents wasting compute on inputs that are already clearly malicious.

Anomaly scoring

Each match produces a score using the formula:

score = confidence (0.0–1.0) × severity (1–15)

The effective scan score is not a simple sum of all match scores. Instead, InferenceWall uses a max-primary + diminishing corroboration approach — similar to OWASP CRS — where the highest-scoring match anchors the total and each additional match contributes a diminishing increment. This prevents low-severity matches from inflating scores into block territory.

Decision thresholds

The effective score is compared against per-direction thresholds to produce the final decision:

Direction	allow	flag	block
Inbound (user → LLM)	score < 4.0	score ≥ 4.0	score ≥ 10.0
Outbound (LLM → user)	score < 3.0	score ≥ 3.0	score ≥ 7.0

Outbound thresholds are lower because data leakage — exposed PII, API keys, or credentials — is typically more damaging than a failed injection attempt.

ScanResponse fields

Both inferwall.scan_input() and inferwall.scan_output() return a ScanResponse object:

Field	Type	Description
`decision`	`str`	`"allow"`, `"flag"`, or `"block"`
`score`	`float`	Effective anomaly score for this scan
`matches`	`list`	Each matched signature with its individual score, confidence, severity, and matched text
`request_id`	`str`	Unique ID for this scan request (e.g. `"req-1712345678000"`)

MITRE ATLAS alignment

InferenceWall implements three MITRE ATLAS mitigations across its pipeline:

AML.M0015 Adversarial Input Detection — the heuristic and semantic layers detect and block atypical queries before they reach your LLM.
AML.M0020 Generative AI Guardrails — the classifier and LLM-judge act as safety filters between the model and the user, applied to both input and output.
AML.M0006 Ensemble Methods — the four-engine architecture implements ensemble detection, combining independent techniques so that evading one layer does not defeat the system.

All 100 signatures are individually mapped to ATLAS technique IDs. See Detection Signatures for the full technique coverage table.

Next steps

Anomaly Scoring and Decisions — understand exactly how scores are calculated and thresholds are applied.
Detection Signatures — explore the 100 built-in signatures and their MITRE ATLAS mappings.

​Detection pipeline

​Anomaly scoring

​Decision thresholds

​ScanResponse fields

​MITRE ATLAS alignment

Detection pipeline

Anomaly scoring

Decision thresholds

ScanResponse fields

MITRE ATLAS alignment