Detection Signatures in InferenceWall

InferenceWall ships with 100 detection signatures across five categories: prompt injection, content safety, data leakage, system prompt, and agentic threats. Every signature is mapped to one or more MITRE ATLAS technique IDs, so you can assess coverage against the adversarial AI threat taxonomy directly.

Signature categories

Category	ID prefix	Count	What it detects
Prompt Injection	`INJ`	67	Direct injection (30), indirect injection (10), obfuscation (18), jailbreaks (20 within INJ), semantic paraphrasing (10)
Data Leakage	`DL`	14	PII in output — `DL-P-` (8 sigs); secrets and credentials — `DL-S-` (6 sigs)
Content Safety	`CS`	9	Toxicity — `CS-T-` (7 sigs); bias — `CS-B-` (2 sigs)
System Prompt	`SP`	4	Prompt leak in output (2 sigs), training data extraction (1 sig), model probing (1 sig)
Agentic	`AG`	6	Tool abuse (2 sigs), privilege escalation (1 sig), host escape (2 sigs), exfiltration via agent (1 sig)

Counts in the Prompt Injection category overlap: jailbreak signatures (INJ-D-001, INJ-D-006, INJ-D-010 through INJ-D-029) are a subset of the 30 direct injection signatures. The 20 jailbreak signatures cover role-play personas, DAN variants, named jailbreak personas, debug/developer mode activation, and amoral bot framing.

MITRE ATLAS technique coverage

ATLAS technique	Name	Signatures
AML.T0051.000	LLM Prompt Injection: Direct	30
AML.T0051.001	LLM Prompt Injection: Indirect	10
AML.T0054	LLM Jailbreak	20
AML.T0068	LLM Prompt Obfuscation	18
AML.T0056	LLM Meta Prompt Extraction	6
AML.T0065	LLM Prompt Crafting	10
AML.T0057	LLM Data Leakage	16
AML.T0055	Unsecured Credentials	6
AML.T0048.002	External Harms: Societal	3
AML.T0048.003	External Harms: User Harm	6
AML.T0024	Exfiltration via AI Inference API	1
AML.T0069	Discover LLM System Information	1
AML.T0053	AI Agent Tool Invocation	3
AML.T0080	AI Agent Context Poisoning	1
AML.T0105	Escape to Host	2
AML.T0086	Exfiltration via AI Agent Tool	1

Many signatures map to multiple techniques. The counts above reflect primary technique mappings. Coverage is based on MITRE ATLAS v5.5 (March 2026).

Signature ID format

Every signature ID follows the pattern {CATEGORY}-{SUBCATEGORY}-{NUMBER}:

Category	Prefix	Subcategories
Prompt Injection	`INJ`	`D` (direct), `I` (indirect), `O` (obfuscation), `S` (semantic)
Content Safety	`CS`	`T` (toxicity), `B` (bias)
Data Leakage	`DL`	`P` (PII), `S` (secrets/credentials)
System Prompt	`SP`	— (no subcategory)
Agentic	`AG`	— (no subcategory)

For example, INJ-D-002 is the second direct prompt injection signature, and DL-S-001 is the first secrets/credentials data leakage signature.

Match object

When a signature fires, InferenceWall returns a match object in the matches list of the ScanResponse:

{
  "signature_id": "INJ-D-002",
  "matched_text": "ignore all previous instructions",
  "score": 6.3,
  "confidence": 0.9,
  "severity": 7.0
}

Field	Description
`signature_id`	The ID of the matched signature
`matched_text`	The portion of the input that triggered the match
`score`	Effective score for this match (`confidence × severity`)
`confidence`	Engine confidence (0.0–1.0)
`severity`	Signature severity weight (1–15)

Detection engines by category

Signatures run on the engine that matches their detection technique:

Engine	Profiles	Signature types
Heuristic (Rust)	Lite, Standard, Full	`regex`, `substring`, `encoding`, `unicode` patterns
ML classifier (ONNX)	Standard, Full	`classifier` — DeBERTa/DistilBERT models
Semantic (FAISS + MiniLM)	Standard, Full	`semantic` — embedding similarity
LLM-judge (Phi-4 Mini Q4)	Full	`composite` — borderline and multi-step cases

To add your own signatures or override shipped ones, see Custom Signatures.

​Signature categories

​MITRE ATLAS technique coverage

​Signature ID format

​Match object

​Detection engines by category

Signature categories

MITRE ATLAS technique coverage

Signature ID format

Match object

Detection engines by category