Skip to content

Fix aggregate_events leaking non-matching sub-events into buckets#2013

Open
jichaowang02-lang wants to merge 1 commit into
smicallef:masterfrom
jichaowang02-lang:fix/aggregate-events-list-mutation
Open

Fix aggregate_events leaking non-matching sub-events into buckets#2013
jichaowang02-lang wants to merge 1 commit into
smicallef:masterfrom
jichaowang02-lang:fix/aggregate-events-list-mutation

Conversation

@jichaowang02-lang

Copy link
Copy Markdown

Summary

SpiderFootCorrelator.aggregate_events corrupts its result buckets when a
correlation rule's field is a sub-field (e.g. child.data, source.data,
entity.data): each bucket retains sub-events that do not match the bucket
value, which then skews every downstream analysis that re-extracts sub-fields
(threshold counts, outlier counts, match_all_to_first_collection).

Root cause

event_strip() mutates the list it is iterating:

for s in event[topfield]:
    if s[subfield] != value:
        event[topfield].remove(s)

Removing the current element shifts the remaining elements down one index while
the iterator still advances, so the element immediately after each removed
one is never inspected and survives.

For field = "child.data" and one event with children a, b, c, d:

bucket expected actual (before fix)
a ["a"] ["a", "c"]
b ["b"] ["b", "d"]
c ["c"] ["b", "c"]
d ["d"] ["b", "d"]

(Reproduced against the real aggregate_events.)

Fix

Iterate over a snapshot copy so removal can't skip elements:

for s in event[topfield][:]:
    ...

Testing

$ pytest test/unit/spiderfoot/test_spiderfootcorrelator.py -q
17 passed, 55 subtests passed

Adds test_aggregate_events_does_not_leak_non_matching_subevents, asserting
each bucket keeps only its matching children. flake8-clean on the changed lines.

`event_strip()` iterated over `event[topfield]` while calling `.remove()` on
the same list. Removing the current element advances the iterator past the
next one, so a non-matching sub-event immediately following a removed one is
never inspected and wrongly survives in the bucket. When a correlation rule's
field is a sub-field (e.g. `child.data`), every bucket then retains foreign
children (bucket `a` kept `["a", "c"]`), corrupting the threshold / outlier /
match analyses that re-extract sub-fields from the bucketed events.

Iterate over a snapshot copy (`event[topfield][:]`) so removal can't skip
elements. Adds a regression test asserting each bucket keeps only its matching
children.
Copilot AI review requested due to automatic review settings June 21, 2026 15:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants