allow, flag, or block — is returned to your application.
Detection pipeline
Heuristic engine (Rust, sub-0.3ms p99)
The first layer is a Rust-powered heuristic engine that runs regex, substring (Aho-Corasick), encoding detection, and Unicode obfuscation checks against all 75 heuristic signatures. Because it runs in native code, p99 latency stays under 0.3ms regardless of input length.This layer covers direct injection commands, jailbreak phrases, obfuscated payloads (base64, ROT13, homoglyphs), and PII/credential patterns in output.
ML classifier engine (ONNX — DeBERTa/DistilBERT)
Available in the Standard and Full profiles, the ML classifier loads quantized DeBERTa and DistilBERT models via ONNX Runtime. It runs 11 classifier signatures against content that the heuristic layer did not catch with high confidence — things like few-shot poisoning, translation bypass, and tool response injection.The classifier produces a confidence score (0.0–1.0) for each match, which feeds directly into anomaly scoring.
Semantic similarity engine (FAISS + MiniLM)
The semantic engine uses FAISS vector search with MiniLM embeddings to catch paraphrased attacks — inputs that carry the same adversarial intent as known attacks but use different wording. It runs 10 semantic signatures covering paraphrased instruction overrides, role hijacking, social engineering, and indirect injection via tool output.This layer is what catches attacks that have been deliberately reworded to evade pattern-based detection.
LLM-judge (Phi-4 Mini Q4)
Available in the Full profile, the LLM-judge runs only when the accumulated score is in a borderline range — high enough to warrant scrutiny but below the early-exit threshold. It uses a quantized Phi-4 Mini model to reason about whether the input is genuinely adversarial, reducing false positives on ambiguous inputs.The judge runs composite signatures for multi-turn escalation, RAG poisoning, payload splitting, and language-switch bypass.
If the score reaches the
early_exit threshold (default 13.0) after any layer, the pipeline skips all downstream engines. This prevents wasting compute on inputs that are already clearly malicious.Anomaly scoring
Each match produces a score using the formula:Decision thresholds
The effective score is compared against per-direction thresholds to produce the final decision:| Direction | allow | flag | block |
|---|---|---|---|
| Inbound (user → LLM) | score < 4.0 | score ≥ 4.0 | score ≥ 10.0 |
| Outbound (LLM → user) | score < 3.0 | score ≥ 3.0 | score ≥ 7.0 |
ScanResponse fields
Bothinferwall.scan_input() and inferwall.scan_output() return a ScanResponse object:
| Field | Type | Description |
|---|---|---|
decision | str | "allow", "flag", or "block" |
score | float | Effective anomaly score for this scan |
matches | list | Each matched signature with its individual score, confidence, severity, and matched text |
request_id | str | Unique ID for this scan request (e.g. "req-1712345678000") |
MITRE ATLAS alignment
InferenceWall implements three MITRE ATLAS mitigations across its pipeline:- AML.M0015 Adversarial Input Detection — the heuristic and semantic layers detect and block atypical queries before they reach your LLM.
- AML.M0020 Generative AI Guardrails — the classifier and LLM-judge act as safety filters between the model and the user, applied to both input and output.
- AML.M0006 Ensemble Methods — the four-engine architecture implements ensemble detection, combining independent techniques so that evading one layer does not defeat the system.
Next steps
- Anomaly Scoring and Decisions — understand exactly how scores are calculated and thresholds are applied.
- Detection Signatures — explore the 100 built-in signatures and their MITRE ATLAS mappings.