Skip to content

Latest commit

 

History

History
364 lines (294 loc) · 17.9 KB

File metadata and controls

364 lines (294 loc) · 17.9 KB

scanipy taint-DSL reference

Status: v0 — LOCKED for 0.2.0. This is the v0 schema and it is frozen for the 0.2.0 release: the fields, pattern kinds, constraints, and flow grammar described here are the contract the 0.2.0 engine implements and will not change within the release. (Pre-1.0.0 the schema may still evolve in a future minor — see CHANGELOG.md — but a detector that validates against this reference works with 0.2.0 as written.) This file is the single source of truth for the spec format; other docs link here rather than restating it.

The parser is implemented and shipping (scanipy.dsl.parse_spec): it validates every field, all four pattern kinds (call, attribute, parameter, import), and the flow grammar, raising a location-aware DSLError on anything outside the DSL. See Validation & errors.

A detector is a declarative YAML file that tells scanipy how to find one class of vulnerability by taint tracking: follow untrusted data from a source, through optional propagators, to a dangerous sink — unless a sanitizer neutralizes it on the way. Detection logic lives entirely in these specs; the engine is class-agnostic (principle P4).

Bundled specs live in src/scanipy/detectors/<class>/<name>.yml and ship as package data.


File layout

id: python.injection.os-command      # unique id: <language>.<class>.<name>
name: OS command injection           # short human title
cwe: CWE-78                          # primary CWE
severity: high                       # low | medium | high | critical
languages: [python]                  # languages this spec applies to
message: >                           # shown on every finding; say what + how to fix
  Untrusted input reaches an OS command without sanitization...
metadata:                            # optional, free-form
  owasp: "A03:2021-Injection"
  references:
    - https://cwe.mitre.org/data/definitions/78.html

sources:    [ <pattern>, ... ]       # where taint enters        (required, >= 1)
sinks:      [ <pattern>, ... ]       # where taint is dangerous  (required, >= 1)
sanitizers: [ <pattern>, ... ]       # what neutralizes taint    (optional)
propagators:[ <propagator>, ... ]    # how taint flows through   (optional)

Top-level fields

Field Required Notes
id yes Globally unique. Convention: <language>.<class>.<name>.
name yes Short human-readable title.
cwe yes Primary CWE identifier; must match CWE-<digits> (e.g. CWE-78).
severity yes One of low, medium, high, critical (lowercase strings).
languages yes Non-empty list; python is the only supported value in v1 (P7).
message yes Explains the flaw and the fix; rendered on every finding.
metadata no Free-form map (owasp, references, …); document order is preserved.
sources yes One or more patterns.
sinks yes One or more patterns.
sanitizers no Optional; may be []. A missing sanitizer never raises (P5).
propagators no Optional. Defaults to the engine's built-in propagation.

Any unknown top-level key is rejected. Required keys must be present; the parser reports the first offending key in document order (deterministic, P3).


Validation & errors

scanipy.dsl.parse_spec validates every field shape, enum, and pattern/flow grammar, and rejects anything outside the DSL (unknown keys/kinds, bad enums, malformed patterns or flows, empty sources/sinks). Validation is exhaustive and declarative (P4): there is no per-detector or per-CWE logic in the parser.

On any problem it raises a scanipy.dsl.DSLError (a ValueError subclass) whose str() is a single, deterministic line:

path:line:col: [spec_id] field: message

for example:

detectors/injection/os-command.yml:29:5: [python.injection.os-command] sinks[1].when: unknown 'when' condition 'argument'; v1 supports: keyword

The error also exposes the raw pieces as attributes (.spec_id, .field, .source_path, .line, .column) for programmatic use. Lines are 1-based and columns 0-based. Invalid YAML and structural problems (empty document, non-mapping root, duplicate keys) are reported the same way — a raw yaml exception never escapes.


Patterns

A pattern matches a syntactic site. It has a kind, a dotted pattern string (with * wildcards), and optional constraints.

{ kind: call, pattern: "os.system", args: [0] }
{ kind: attribute, pattern: "flask.request.*" }
{ kind: call, pattern: "subprocess.*", when: { keyword: { shell: true } } }

kind

kind Matches Pattern shape Status
call a function/method call, e.g. os.system(...) dotted path, * wildcards supported
attribute an attribute access, e.g. flask.request.args dotted path, * wildcards supported
parameter a function parameter (request-handler args) a bare name (request) or a scoped selector (handler.request) supported
import an imported module/name a module path, optionally ending in * (pickle, flask.*) supported

All four kinds share the same dotted-path grammar (see pattern). The parser validates pattern shape for every kind; the runtime meaning of parameter/import is resolved by the engine. args and when are accepted only on kind: call — see the validity matrix below.

The matcher resolves each kind against a different canonical name: a call matches on the callee path, an attribute on the attribute chain, an import on the imported canonical name, and a parameter on the bare parameter name (matched with the same wildcard grammar). The shape of all four kinds is part of the locked v0 schema and is fully validated by the parser. No bundled 0.2.0 detector uses parameter/import — the seven shipped detectors are written with call/attribute patterns — so while their pattern shape is part of the locked v0 schema and fully validated by the parser, their richer runtime semantics are not relied upon by the shipped catalog in this release (honest scope, P7); treat them as structural-for-now.

pattern

A dotted path with * as a wildcard segment:

  • os.system — exactly os.system
  • subprocess.* — any direct attribute of subprocess (run, Popen, …)
  • *.cursor.executeexecute on any object's .cursor

Wildcard matching semantics (pinned)

The pattern is matched against the frontend's canonical dotted name for the site (imports/aliases already resolved). Matching is segment-wise — both the pattern and the name are split on . and compared segment by segment, never as a substring or a glob over the raw string. A single * segment is allowed, and its position picks one of exactly three modes:

Mode Where * is * consumes Example pattern Matches Does not match
Exact no * os.system os.system os.popen; mymod.os.system
Exact (bare) no * input input mymod.input
Trailing-single last segment exactly one segment subprocess.* subprocess.run subprocess.run.foo; bare subprocess
Trailing-single last segment exactly one segment flask.request.* flask.request.args flask.request.args.get
Leading-greedy first segment one or more segments *.execute db.execute; self.db.cursor.execute db.executemany; bare execute
Leading-greedy first segment one or more segments *.cursor.execute self.db.cursor.execute self.db.execute
  • Trailing-single (pkg.*) fixes the literal prefix and requires the name to be exactly one segment longer — * stands for one attribute, not a subtree.
  • Leading-greedy (*.tail) fixes the literal suffix and lets * swallow one or more leading segments, with the receiver prefix left unconstrained. This is the load-bearing safety net for idiomatic method sinks: *.execute fires on self.db.cursor.execute (an aliased/deep receiver) without the spec author having to enumerate every receiver shape. It is intentionally greedy, unlike trailing-single, and is matched per segment, so *.execute never matches executemany (different last segment).
  • A lone * is the trailing-single case with an empty prefix: it matches any single-segment name and nothing dotted.
  • Any other * placement is not a valid pattern and the parser rejects it at load time with a DSLError — a typo'd pattern fails loudly rather than becoming a silently-dead rule (P5/P7). * is allowed only once, as a single whole leading segment (*.suffix) or a single whole trailing segment (prefix.*). Rejected examples: a partial-segment * like os.sys*, a mid-segment wildcard like a.*.c or os.*.system, and more than one * like *.* or *.a.*. The matcher additionally never widens: were such a Pattern ever constructed directly (bypassing the parser), it is treated as no match (defense-in-depth — the matcher never guesses).
  • An unresolved name (the frontend could not canonicalize the callee/target, e.g. foo()()) is always a no-match — never an error.

Optional constraints

Key Valid on Meaning
args call only Restrict to specific positional argument indices, e.g. args: [0]. A non-empty list of non-negative integers; the parser sorts and de-duplicates them. Taint in any listed argument triggers the rule.
when call only Extra conditions. v0 supports exactly when: { keyword: { name: value } } — e.g. require shell=True. The name must be a valid identifier and the value must be a scalar.

args / when validity matrix

args and when are accepted only on kind: call; the parser rejects them on attribute, parameter, and import with a precise error.

kind args when
call allowed allowed
attribute rejected rejected
parameter rejected rejected
import rejected rejected

Scalar typing is exact: shell: true is a YAML boolean and stays a bool, while shell: 'true' stays the string "true". The two are kept distinct so the engine's shell=True check is unambiguous.

args matching semantics (pinned)

  • Indices are 0-based and positional, and they exclude the receiver of a method call. On *.execute the receiver is the object before .execute; args: [0] targets the first written argument (the SQL string), not the receiver. The receiver is addressed as self in the flow vocabulary.
  • With no args key, every written positional argument is in scope.
  • With an args restriction, the engine checks the intersection of the listed indices with the call's actual written positional arguments. Out-of-range and negative indices are dropped; the surviving set is sorted and de-duplicated.
  • If a restriction names only out-of-range indices (e.g. args: [5] on a two-argument call), the pattern does not match that site — it cannot carry the targeted taint.
  • Known gap (v1): args is positional-only. A dangerous value passed by keyword (subprocess.run(args=cmd, shell=True)) is not covered by a positional args restriction — this is the deferred "kwarg-targeted args" DSL extension. The net effect is a false negative (honest scope, P7); model such sinks with when (as the os-command spec does for subprocess.*) where possible. *args splats are counted as one written positional; over-approximating them is the engine's decision, not the matcher's.

when matching semantics (pinned)

  • v1 supports exactly the keyword condition. when: { keyword: { shell: true } } matches only a keyword argument passed as a constant literal equal to the expected value. shell=True matches only a literal True; shell=False, an absent shell, or shell=<variable> (a non-literal) do not match.
  • Multiple keyword pairs are ANDed — every pair must hold. Keyword names and when keys are evaluated in sorted order for determinism (P3).
  • Any unknown top-level when key (anything other than keyword) is treated as no match, so an unsupported constraint can never silently widen matches. The parser is the real gate and rejects such keys up front.
  • Known gap (v1): literal-equality only — shell=<truthy variable> is a false negative. Niche, documented (P7).

Propagators

A propagator describes how taint moves through an intermediate call, using a flow from one position to another. A propagator is a kind: call pattern (only call is allowed) plus a required flow mapping with exactly the keys from and to.

propagators:
  - { kind: call, pattern: "str.format", flow: { from: any-arg, to: return } }
  - { kind: call, pattern: "os.path.join", flow: { from: any-arg, to: return } }

Flow vocabulary

from and to each take exactly one token from the grammar below; anything else (e.g. returns, arg:x, an empty string) is rejected.

Token Meaning
any-arg any positional argument
arg:N the Nth positional argument (0-based, N a non-negative integer)
self the receiver of a method call
return the call's return value

The YAML key from maps to the Flow.from_ field (Python keyword). The engine ships with sensible default propagation (e.g. string concatenation and f-strings carry taint); propagators add library-specific flows.


Sanitizers and soundness (P5)

A sanitizer removes taint. Sanitizers are trusted in the safe direction only: if scanipy is missing a sanitizer it will at worst report a false positive (noise) — it must never silently suppress a real vulnerability. When in doubt, leave a sanitizer out. This one-sidedness is principle P5.

Note that some "fixes" are not sanitizers of a string at all. For SQL injection, the fix is a bound-parameter call (a different, safe sink), not a function that cleans the string — so the SQL detector ships with no string sanitizers.


Every detector ships a TP and a TN fixture (P5)

A spec is not done until it has both:

  • a true-positive fixture (vulnerable code it must flag), and
  • a true-negative fixture (safe/sanitized code it must not flag),

under tests/fixtures/python/{vulnerable,safe}/. See writing-detectors.md and the /new-detector helper.


v1 known limitations (catalog scope, P7)

The bundled v1 catalog is written entirely in the DSL surface above. A few vulnerability shapes are not expressible with the v1 taint engine and DSL; they are documented here rather than shipped as dead or misleading specs (P7).

  • Safe-loader-as-keyword is a false positive risk. A detector cannot tell yaml.load(data) (unsafe) from yaml.load(data, Loader=SafeLoader) (safe): both match the yaml.load sink, and the when: {keyword: …} constraint only asserts a literal equals a value — it cannot assert that a safe loader is present. The unsafe-deserialization detector therefore flags both, so the recommended safe form in the bundled true-negative fixture is the distinct safe sink yaml.safe_load(data) (never matched), not yaml.load(..., Loader=…). Treating yaml.load(data, Loader=SafeLoader) as a finding is an accepted FP (noise, never a missed vuln — P5).

  • "Presence" sinks (taint-independent flags) are out of scope. The engine only emits a finding when tainted data reaches a checked argument. A vulnerability whose danger is the mere presence of an insecure option — independent of any tainted value — cannot be expressed. The canonical example is TLS verification disabled (requests.get(url, verify=False), CWE-295): the risk is verify=False itself, even for a constant url that carries no taint. A when: {keyword: {verify: false}} sink would only fire when taint also reaches a positional URL argument, which collapses into SSRF and misses the textbook constant-URL case. tls-verify-disabled is therefore deferred until the engine grows a non-taint presence-sink primitive — it is not shipped as a dead spec.

  • Application-specific sanitizers can't be pattern-matched. Some safe forms are validation logic, not a callable to name. SSRF is fixed by an allow-list host check, which is application code, not a library function — so the SSRF detector ships no sanitizer, and its true-negative fixture demonstrates the untainted-input case (a constant URL) rather than a sanitizer. A real but unrecognized allow-list check yields a false positive (noise — P5), never a missed vuln.

  • Method-sink receiver shapes rely on leading-greedy wildcards. The SQL detector uses *.execute / *.cursor.execute so it fires on idiomatic receivers (conn.cursor().execute(...), aliased cursors) without enumerating every shape. XML sinks deliberately stay module-qualified (lxml.etree.*, xml.etree.ElementTree.*) instead of a *.parse greedy form, so a safe defusedxml.ElementTree.parse(...) is never matched.


Worked example

id: python.injection.os-command
name: OS command injection
cwe: CWE-78
severity: high
languages: [python]
message: >
  Untrusted input reaches an OS command without sanitization. Prefer a list
  argv with shell=False, or quote inputs with shlex.quote.
sources:
  - { kind: call, pattern: "input" }
  - { kind: attribute, pattern: "flask.request.*" }
sanitizers:
  - { kind: call, pattern: "shlex.quote" }
sinks:
  - { kind: call, pattern: "os.system", args: [0] }
  - { kind: call, pattern: "subprocess.*", when: { keyword: { shell: true } } }
propagators:
  - { kind: call, pattern: "str.format", flow: { from: any-arg, to: return } }