Status: v0 — LOCKED for 0.2.0. This is the v0 schema and it is frozen for the 0.2.0 release: the fields, pattern kinds, constraints, and flow grammar described here are the contract the 0.2.0 engine implements and will not change within the release. (Pre-1.0.0 the schema may still evolve in a future minor — see CHANGELOG.md — but a detector that validates against this reference works with 0.2.0 as written.) This file is the single source of truth for the spec format; other docs link here rather than restating it.
The parser is implemented and shipping (
scanipy.dsl.parse_spec): it validates every field, all four pattern kinds (call,attribute,parameter,import), and the flow grammar, raising a location-awareDSLErroron anything outside the DSL. See Validation & errors.
A detector is a declarative YAML file that tells scanipy how to find one class of vulnerability by taint tracking: follow untrusted data from a source, through optional propagators, to a dangerous sink — unless a sanitizer neutralizes it on the way. Detection logic lives entirely in these specs; the engine is class-agnostic (principle P4).
Bundled specs live in src/scanipy/detectors/<class>/<name>.yml and ship as
package data.
id: python.injection.os-command # unique id: <language>.<class>.<name>
name: OS command injection # short human title
cwe: CWE-78 # primary CWE
severity: high # low | medium | high | critical
languages: [python] # languages this spec applies to
message: > # shown on every finding; say what + how to fix
Untrusted input reaches an OS command without sanitization...
metadata: # optional, free-form
owasp: "A03:2021-Injection"
references:
- https://cwe.mitre.org/data/definitions/78.html
sources: [ <pattern>, ... ] # where taint enters (required, >= 1)
sinks: [ <pattern>, ... ] # where taint is dangerous (required, >= 1)
sanitizers: [ <pattern>, ... ] # what neutralizes taint (optional)
propagators:[ <propagator>, ... ] # how taint flows through (optional)| Field | Required | Notes |
|---|---|---|
id |
yes | Globally unique. Convention: <language>.<class>.<name>. |
name |
yes | Short human-readable title. |
cwe |
yes | Primary CWE identifier; must match CWE-<digits> (e.g. CWE-78). |
severity |
yes | One of low, medium, high, critical (lowercase strings). |
languages |
yes | Non-empty list; python is the only supported value in v1 (P7). |
message |
yes | Explains the flaw and the fix; rendered on every finding. |
metadata |
no | Free-form map (owasp, references, …); document order is preserved. |
sources |
yes | One or more patterns. |
sinks |
yes | One or more patterns. |
sanitizers |
no | Optional; may be []. A missing sanitizer never raises (P5). |
propagators |
no | Optional. Defaults to the engine's built-in propagation. |
Any unknown top-level key is rejected. Required keys must be present; the parser reports the first offending key in document order (deterministic, P3).
scanipy.dsl.parse_spec validates every field shape, enum, and pattern/flow
grammar, and rejects anything outside the DSL (unknown keys/kinds, bad enums,
malformed patterns or flows, empty sources/sinks). Validation is exhaustive
and declarative (P4): there is no per-detector or per-CWE logic in the parser.
On any problem it raises a scanipy.dsl.DSLError (a ValueError subclass) whose
str() is a single, deterministic line:
path:line:col: [spec_id] field: message
for example:
detectors/injection/os-command.yml:29:5: [python.injection.os-command] sinks[1].when: unknown 'when' condition 'argument'; v1 supports: keyword
The error also exposes the raw pieces as attributes (.spec_id, .field,
.source_path, .line, .column) for programmatic use. Lines are 1-based and
columns 0-based. Invalid YAML and structural problems (empty document,
non-mapping root, duplicate keys) are reported the same way — a raw yaml
exception never escapes.
A pattern matches a syntactic site. It has a kind, a dotted pattern
string (with * wildcards), and optional constraints.
{ kind: call, pattern: "os.system", args: [0] }
{ kind: attribute, pattern: "flask.request.*" }
{ kind: call, pattern: "subprocess.*", when: { keyword: { shell: true } } }kind |
Matches | Pattern shape | Status |
|---|---|---|---|
call |
a function/method call, e.g. os.system(...) |
dotted path, * wildcards |
supported |
attribute |
an attribute access, e.g. flask.request.args |
dotted path, * wildcards |
supported |
parameter |
a function parameter (request-handler args) | a bare name (request) or a scoped selector (handler.request) |
supported |
import |
an imported module/name | a module path, optionally ending in * (pickle, flask.*) |
supported |
All four kinds share the same dotted-path grammar (see pattern).
The parser validates pattern shape for every kind; the runtime meaning of
parameter/import is resolved by the engine. args and when are accepted
only on kind: call — see the validity matrix below.
The matcher resolves each kind against a different canonical name: a call
matches on the callee path, an attribute on the attribute chain, an
import on the imported canonical name, and a parameter on the bare
parameter name (matched with the same wildcard grammar). The shape of all
four kinds is part of the locked v0 schema and is fully validated by the parser.
No bundled 0.2.0 detector uses parameter/import — the seven shipped detectors
are written with call/attribute patterns — so while their pattern shape
is part of the locked v0 schema and fully validated by the parser, their richer
runtime semantics are not relied upon by the shipped catalog in this
release (honest scope, P7); treat them as structural-for-now.
A dotted path with * as a wildcard segment:
os.system— exactlyos.systemsubprocess.*— any direct attribute ofsubprocess(run,Popen, …)*.cursor.execute—executeon any object's.cursor
The pattern is matched against the frontend's canonical dotted name for the
site (imports/aliases already resolved). Matching is segment-wise — both the
pattern and the name are split on . and compared segment by segment, never as a
substring or a glob over the raw string. A single * segment is allowed, and its
position picks one of exactly three modes:
| Mode | Where * is |
* consumes |
Example pattern | Matches | Does not match |
|---|---|---|---|---|---|
| Exact | no * |
— | os.system |
os.system |
os.popen; mymod.os.system |
| Exact (bare) | no * |
— | input |
input |
mymod.input |
| Trailing-single | last segment | exactly one segment | subprocess.* |
subprocess.run |
subprocess.run.foo; bare subprocess |
| Trailing-single | last segment | exactly one segment | flask.request.* |
flask.request.args |
flask.request.args.get |
| Leading-greedy | first segment | one or more segments | *.execute |
db.execute; self.db.cursor.execute |
db.executemany; bare execute |
| Leading-greedy | first segment | one or more segments | *.cursor.execute |
self.db.cursor.execute |
self.db.execute |
- Trailing-single (
pkg.*) fixes the literal prefix and requires the name to be exactly one segment longer —*stands for one attribute, not a subtree. - Leading-greedy (
*.tail) fixes the literal suffix and lets*swallow one or more leading segments, with the receiver prefix left unconstrained. This is the load-bearing safety net for idiomatic method sinks:*.executefires onself.db.cursor.execute(an aliased/deep receiver) without the spec author having to enumerate every receiver shape. It is intentionally greedy, unlike trailing-single, and is matched per segment, so*.executenever matchesexecutemany(different last segment). - A lone
*is the trailing-single case with an empty prefix: it matches any single-segment name and nothing dotted. - Any other
*placement is not a valid pattern and the parser rejects it at load time with aDSLError— a typo'd pattern fails loudly rather than becoming a silently-dead rule (P5/P7).*is allowed only once, as a single whole leading segment (*.suffix) or a single whole trailing segment (prefix.*). Rejected examples: a partial-segment*likeos.sys*, a mid-segment wildcard likea.*.coros.*.system, and more than one*like*.*or*.a.*. The matcher additionally never widens: were such aPatternever constructed directly (bypassing the parser), it is treated as no match (defense-in-depth — the matcher never guesses). - An unresolved name (the frontend could not canonicalize the callee/target,
e.g.
foo()()) is always a no-match — never an error.
| Key | Valid on | Meaning |
|---|---|---|
args |
call only |
Restrict to specific positional argument indices, e.g. args: [0]. A non-empty list of non-negative integers; the parser sorts and de-duplicates them. Taint in any listed argument triggers the rule. |
when |
call only |
Extra conditions. v0 supports exactly when: { keyword: { name: value } } — e.g. require shell=True. The name must be a valid identifier and the value must be a scalar. |
args and when are accepted only on kind: call; the parser rejects them
on attribute, parameter, and import with a precise error.
kind |
args |
when |
|---|---|---|
call |
allowed | allowed |
attribute |
rejected | rejected |
parameter |
rejected | rejected |
import |
rejected | rejected |
Scalar typing is exact: shell: true is a YAML boolean and stays a bool,
while shell: 'true' stays the string "true". The two are kept distinct so the
engine's shell=True check is unambiguous.
- Indices are 0-based and positional, and they exclude the receiver of a
method call. On
*.executethe receiver is the object before.execute;args: [0]targets the first written argument (the SQL string), not the receiver. The receiver is addressed asselfin the flow vocabulary. - With no
argskey, every written positional argument is in scope. - With an
argsrestriction, the engine checks the intersection of the listed indices with the call's actual written positional arguments. Out-of-range and negative indices are dropped; the surviving set is sorted and de-duplicated. - If a restriction names only out-of-range indices (e.g.
args: [5]on a two-argument call), the pattern does not match that site — it cannot carry the targeted taint. - Known gap (v1):
argsis positional-only. A dangerous value passed by keyword (subprocess.run(args=cmd, shell=True)) is not covered by a positionalargsrestriction — this is the deferred "kwarg-targeted args" DSL extension. The net effect is a false negative (honest scope, P7); model such sinks withwhen(as the os-command spec does forsubprocess.*) where possible.*argssplats are counted as one written positional; over-approximating them is the engine's decision, not the matcher's.
- v1 supports exactly the
keywordcondition.when: { keyword: { shell: true } }matches only a keyword argument passed as a constant literal equal to the expected value.shell=Truematches only a literalTrue;shell=False, an absentshell, orshell=<variable>(a non-literal) do not match. - Multiple keyword pairs are ANDed — every pair must hold. Keyword names and
whenkeys are evaluated in sorted order for determinism (P3). - Any unknown top-level
whenkey (anything other thankeyword) is treated as no match, so an unsupported constraint can never silently widen matches. The parser is the real gate and rejects such keys up front. - Known gap (v1): literal-equality only —
shell=<truthy variable>is a false negative. Niche, documented (P7).
A propagator describes how taint moves through an intermediate call, using a
flow from one position to another. A propagator is a kind: call pattern (only
call is allowed) plus a required flow mapping with exactly the keys
from and to.
propagators:
- { kind: call, pattern: "str.format", flow: { from: any-arg, to: return } }
- { kind: call, pattern: "os.path.join", flow: { from: any-arg, to: return } }from and to each take exactly one token from the grammar below; anything else
(e.g. returns, arg:x, an empty string) is rejected.
| Token | Meaning |
|---|---|
any-arg |
any positional argument |
arg:N |
the Nth positional argument (0-based, N a non-negative integer) |
self |
the receiver of a method call |
return |
the call's return value |
The YAML key from maps to the Flow.from_ field (Python keyword). The engine
ships with sensible default propagation (e.g. string concatenation and f-strings
carry taint); propagators add library-specific flows.
A sanitizer removes taint. Sanitizers are trusted in the safe direction only: if scanipy is missing a sanitizer it will at worst report a false positive (noise) — it must never silently suppress a real vulnerability. When in doubt, leave a sanitizer out. This one-sidedness is principle P5.
Note that some "fixes" are not sanitizers of a string at all. For SQL injection, the fix is a bound-parameter call (a different, safe sink), not a function that cleans the string — so the SQL detector ships with no string sanitizers.
A spec is not done until it has both:
- a true-positive fixture (vulnerable code it must flag), and
- a true-negative fixture (safe/sanitized code it must not flag),
under tests/fixtures/python/{vulnerable,safe}/. See
writing-detectors.md and the /new-detector helper.
The bundled v1 catalog is written entirely in the DSL surface above. A few vulnerability shapes are not expressible with the v1 taint engine and DSL; they are documented here rather than shipped as dead or misleading specs (P7).
-
Safe-loader-as-keyword is a false positive risk. A detector cannot tell
yaml.load(data)(unsafe) fromyaml.load(data, Loader=SafeLoader)(safe): both match theyaml.loadsink, and thewhen: {keyword: …}constraint only asserts a literal equals a value — it cannot assert that a safe loader is present. Theunsafe-deserializationdetector therefore flags both, so the recommended safe form in the bundled true-negative fixture is the distinct safe sinkyaml.safe_load(data)(never matched), notyaml.load(..., Loader=…). Treatingyaml.load(data, Loader=SafeLoader)as a finding is an accepted FP (noise, never a missed vuln — P5). -
"Presence" sinks (taint-independent flags) are out of scope. The engine only emits a finding when tainted data reaches a checked argument. A vulnerability whose danger is the mere presence of an insecure option — independent of any tainted value — cannot be expressed. The canonical example is TLS verification disabled (
requests.get(url, verify=False), CWE-295): the risk isverify=Falseitself, even for a constanturlthat carries no taint. Awhen: {keyword: {verify: false}}sink would only fire when taint also reaches a positional URL argument, which collapses into SSRF and misses the textbook constant-URL case.tls-verify-disabledis therefore deferred until the engine grows a non-taint presence-sink primitive — it is not shipped as a dead spec. -
Application-specific sanitizers can't be pattern-matched. Some safe forms are validation logic, not a callable to name. SSRF is fixed by an allow-list host check, which is application code, not a library function — so the SSRF detector ships no sanitizer, and its true-negative fixture demonstrates the untainted-input case (a constant URL) rather than a sanitizer. A real but unrecognized allow-list check yields a false positive (noise — P5), never a missed vuln.
-
Method-sink receiver shapes rely on leading-greedy wildcards. The SQL detector uses
*.execute/*.cursor.executeso it fires on idiomatic receivers (conn.cursor().execute(...), aliased cursors) without enumerating every shape. XML sinks deliberately stay module-qualified (lxml.etree.*,xml.etree.ElementTree.*) instead of a*.parsegreedy form, so a safedefusedxml.ElementTree.parse(...)is never matched.
id: python.injection.os-command
name: OS command injection
cwe: CWE-78
severity: high
languages: [python]
message: >
Untrusted input reaches an OS command without sanitization. Prefer a list
argv with shell=False, or quote inputs with shlex.quote.
sources:
- { kind: call, pattern: "input" }
- { kind: attribute, pattern: "flask.request.*" }
sanitizers:
- { kind: call, pattern: "shlex.quote" }
sinks:
- { kind: call, pattern: "os.system", args: [0] }
- { kind: call, pattern: "subprocess.*", when: { keyword: { shell: true } } }
propagators:
- { kind: call, pattern: "str.format", flow: { from: any-arg, to: return } }