Skip to main content
InferenceWall does not make binary allow/block decisions based on individual signature matches. Instead, every match contributes to an anomaly score, and the final decision is based on whether that score crosses a threshold. This lets you tune sensitivity — raising thresholds reduces noise, lowering them increases coverage — without modifying signatures.

Score formula

Each signature match produces a score:
match score = confidence × severity
  • confidence — a float between 0.0 and 1.0 representing how certain the detection engine is about the match. Heuristic matches from the Rust engine typically have confidence 0.91.0. ML classifier and semantic matches vary.
  • severity — an integer from 1 to 15 set by the signature author to reflect how dangerous the detected pattern is. For example, DL-S-004 (Private Keys) has severity 15; DL-P-007 (IP Addresses) has severity 3.

Corroboration (diminishing returns)

The effective scan score is not a simple sum of all match scores. InferenceWall uses a max-primary + diminishing corroboration approach, similar to OWASP CRS:
  1. The highest individual match score becomes the primary score.
  2. Each additional match contributes a diminishing increment to the total.
  3. Multiple weak signals can push a borderline score over a threshold, but they cannot dominate over a single strong signal.
This prevents low-severity signatures from stacking into a block decision and makes the score reflect the most severe threat present.

Early exit

If the accumulated score reaches or exceeds the early_exit threshold (default 13.0) after any engine layer, the pipeline stops and returns immediately. Downstream engines are skipped.
Early exit improves performance on clearly malicious inputs. In most cases, high-severity injection or credential leakage signatures fire in the heuristic layer and trigger early exit before the ML or semantic engines run.

Decision thresholds

The effective score is compared against direction-specific thresholds:
Directionallowflagblock
Inbound (user input)score < 4.0score ≥ 4.0score ≥ 10.0
Outbound (LLM output)score < 3.0score ≥ 3.0score ≥ 7.0
Outbound thresholds are lower because output scanning catches data leakage — exposed credentials, PII, or system prompt contents — where the cost of a false negative (letting it through) is higher than the cost of a false positive (flagging clean output).

ScanResponse example

import inferwall

result = inferwall.scan_input("Ignore all previous instructions")
result.decision    # "block"
result.score       # 12.0
result.matches     # list of matched signatures with individual scores
result.request_id  # "req-1712345678000"
A flag decision means the content exceeded the flag threshold but not the block threshold. Your application can handle flags differently from blocks — for example, routing flagged requests to a human review queue while blocking high-confidence attacks outright.

Configuring thresholds

All five thresholds are configurable in a policy profile:
ThresholdDefaultStrict
inbound_flag4.02.5
inbound_block10.07.0
outbound_flag3.02.0
outbound_block7.05.0
early_exit13.010.0
Thresholds are set per policy profile, not globally. You can have different thresholds for different environments or use cases. See Policy Profiles for details.
Start with mode: monitor and observe scores in production for one to two weeks before setting thresholds. This gives you a baseline of what your real traffic looks like and helps you avoid over-blocking legitimate inputs.