[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059

Mikaayenson · 2025-09-03T20:46:21Z

Pull Request

Issue link(s):

Summary - What I changed

In #4688 @Samirbous is adding the first multi-dataset query to the repo. His PR leveraged EQL to correlate across different datasources per subquery. This PR refactors the integration validation to support multiple datasources used within an eql sequence query (multiple packages with a single integration, multiple integrations).

Important

Instead of validating an entire eql sequence query with a single merged schema, we're not validating subqueries individually with the proper schemas.

To cleanup some of the # type: ignore[reportUnknownVariableType] litter, might be good to @typing.no_type_check

As part of this large refactor, the major change is that previously, we had several branching conditions and multiple validation calls per query just to double check validation. Now for each rule, we build a validation plan by pulling all the right schemas needed. Then execute that validation plan.

Warning

Bumping this to a minor version bump as it may break validation for users (now that we're identifying new potential errors).

How To Test

Added a new unit test class to validate, so CI should pass. I also refactored the test to have more consistent formatting in tests/test_python_library.py
Ensure we're not inadvertently breaking sequence validation
Ensure we didn't introduce regressions in DaC auto schema generation
In testing I identified several rule now failing validation (often when beats schemas were added to rules, where fields were not present in those beats schemas).

Failing Rules

Each have to be manually checked: double_check_siem_rules.txt

Note

Unit test will fail until these rules are tuned.

#5072

Additional Context

EQL’s parser accepts a single flat schema per parse. It has no concept of “schema scoped by dataset per subquery.” If you pass the whole sequence with a merged schema, you lose the ability to enforce that each subquery uses only the fields from its own integration/package.

Why not validate once with a merged schema

Superset masking: A field from integration A will exist in the merged schema even when you’re in a subquery whose dataset is integration B. The parse will succeed, and you won’t catch the misuse.
Type conflicts: Different packages can define the same field name with different types. A merged map can pick one type arbitrarily or last-wins, producing wrong acceptance or wrong errors.
Ambiguous errors: Even if you detect an error, you can’t attribute it cleanly to “subquery X vs package Y” because the validation had no subquery boundary.

Why per-subquery validation is necessary

Flat-schema constraint: EQL validates against one field-type map at a time. To emulate “dataset scoping,” we parse each subquery with only the fields from the dataset’s integration (plus ECS/index/custom as needed).
Correctness by construction: If a subquery references a field from another package, it won’t be present in that subquery’s schema, and the parser raises “Unknown field” (or “Field not recognized” with proper trailer).
Clear attribution: You get an error bound to the specific subquery and its intended package, which is actionable.

Checklist

Added a label for the type of pr: bug, enhancement, schema, maintenance, Rule: New, Rule: Deprecation, Rule: Tuning, Hunt: New, or Hunt: Tuning so guidelines can be generated
Added the meta:rapid-merge label if planning to merge within 24 hours
Secret and sensitive material has been managed correctly
Automated testing was updated or added to match the most common scenarios
Documentation and comments were added for features that require explanation

github-actions · 2025-09-03T20:46:33Z

terrancedejesus · 2025-09-04T14:29:06Z

@Mikaayenson have we tried a rule or query that is truly separate data sources (separate integrations)? Like Okta and Azure Activity logs? The rule mentioned is Azure integration, but Entra ID Protection logs and Entra ID Audit logs as separate data streams. Similar to how we correlate Entra ID Sign ins to Microsoft Graph activity here, but its an ESQL rule. The closest I believe to true separate data sources is this Okta rule which looks at Okta system logs and any logs reported by a Windows endpoint, but does not use event.dataset and thus we did not run into this support issue.

Mikaayenson · 2025-09-04T14:34:54Z

@Mikaayenson have we tried a rule or query that is truly separate data sources (separate integrations)? Like Okta and Azure Activity logs? The rule mentioned is Azure integration, but Entra ID Protection logs and Entra ID Audit logs as separate data streams. Similar to how we correlate Entra ID Sign ins to Microsoft Graph activity here, but its an ESQL rule. The closest I believe to true separate data sources is this Okta rule which looks at Okta system logs and any logs reported by a Windows endpoint, but does not use event.dataset and thus we did not run into this support issue.

@terrancedejesus Did you see the unit tests?

terrancedejesus · 2025-09-04T14:48:10Z

@terrancedejesus Did you see the unit tests?

rgr, thanks for sharing. I see from the testing we do the following:

1 integration:2+ datastreams
2 integrations:2+datastreams

That covers my question. Thank you!

detection_rules/rule_validators.py

…hub.com:elastic/detection-rules into support-multidatasource-eql-integration-queries

Mikaayenson · 2025-09-06T09:26:52Z

tests/test_all_rules.py

                build_rule(query, "kuery")

+    @unittest.skip("Redundant with new validation?")


@eric-forte-elastic

I suspect you are correct, will double check.

Mikaayenson · 2025-09-08T11:15:36Z

detection_rules/rule_validators.py

+        join_fields = ", ".join(map(str, getattr(subquery, "join_values", []) or []))
+        dummy_by = f" by {join_fields}" if join_fields else ""
+        return f"sequence\n  {subquery_text}\n  [any where true]{dummy_by}"


Thought initially about just appending runs=2 instead, but then we loose validation on the join by fields.

return f"sequence\n {subquery_text} with runs=2"

Mikaayenson added 2 commits September 3, 2025 15:17

[FR] Support Multi-Dataset Sequence Validation

898defc

Add test for multiple integrations in a query

991ba7c

Mikaayenson requested a review from Samirbous September 3, 2025 20:46

Mikaayenson self-assigned this Sep 3, 2025

Mikaayenson requested review from eric-forte-elastic and traut as code owners September 3, 2025 20:46

Mikaayenson added enhancement New feature or request test-suite unit and other testing components python Internal python for the repository labels Sep 3, 2025

github-actions bot added backport: auto labels Sep 3, 2025

Mikaayenson marked this pull request as draft September 3, 2025 21:13

Add additional test cases

32dd5f8

Mikaayenson marked this pull request as ready for review September 3, 2025 21:41

Samirbous approved these changes Sep 4, 2025

View reviewed changes

terrancedejesus approved these changes Sep 4, 2025

View reviewed changes

leverage eql to validate subquery with synthetic sequence

9229c52

eric-forte-elastic reviewed Sep 4, 2025

View reviewed changes

detection_rules/rule_validators.py Outdated Show resolved Hide resolved

eric-forte-elastic reviewed Sep 4, 2025

View reviewed changes

detection_rules/rule_validators.py Outdated Show resolved Hide resolved

Mikaayenson marked this pull request as draft September 5, 2025 03:47

Mikaayenson and others added 6 commits September 5, 2025 06:16

Add additional unit test

6cf6665

refactor related integration validation

468c377

Merge branch 'main' into support-multidatasource-eql-integration-queries

5bd2d79

cleanup group by logic

50bd0bd

Merge branch 'support-multidatasource-eql-integration-queries' of git…

c29d07a

…hub.com:elastic/detection-rules into support-multidatasource-eql-integration-queries

refactor schema validation

cfc7364

Mikaayenson changed the title ~~[FR] Support Multi-Dataset Sequence Validation~~ [FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation Sep 6, 2025

Mikaayenson commented Sep 6, 2025

View reviewed changes

Mikaayenson requested review from Aegrah, imays11, terrancedejesus, Samirbous and eric-forte-elastic September 6, 2025 09:29

Mikaayenson added the minor label Sep 6, 2025

minor bump

4ff77e7

Mikaayenson requested a review from shashank-elastic September 6, 2025 09:31

Mikaayenson marked this pull request as ready for review September 6, 2025 09:32

Mikaayenson commented Sep 8, 2025

View reviewed changes

botelastic bot added the schema label Sep 8, 2025

Mikaayenson mentioned this pull request Sep 8, 2025

[Rule Tuning] Beats & Endgame Indices #5072

Merged

5 tasks

Mikaayenson and others added 10 commits September 8, 2025 11:07

add mapping for winlog

da68568

add o365.audit.ExtendedProperties.RequestType to non-ecs-schema

908100b

skip kql validation for stack combos when beats/endgame arent included.

b5a6156

add problemchild fields to winlog non-ecs-schema

748373d

add winlog fields to non-ecs-schema

721e496

better error messages

061c4cc

Update non-ecs-schema.json

37a4593

add more auditbeat fields to non-ecs

ee35586

Merge branch 'main' into support-multidatasource-eql-integration-queries

4bfe850

Merge branch 'main' into support-multidatasource-eql-integration-queries

890e208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059

[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059

Mikaayenson commented Sep 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

terrancedejesus commented Sep 4, 2025 •

edited

Loading

Uh oh!

Mikaayenson commented Sep 4, 2025 •

edited

Loading

Uh oh!

terrancedejesus commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Mikaayenson Sep 6, 2025

Uh oh!

eric-forte-elastic Sep 9, 2025

Uh oh!

Mikaayenson Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

		build_rule(query, "kuery")

		@unittest.skip("Redundant with new validation?")

[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059

Are you sure you want to change the base?

[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059

Conversation

Mikaayenson commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary - What I changed

How To Test

Additional Context

Why not validate once with a merged schema

Why per-subquery validation is necessary

Checklist

Uh oh!

github-actions bot commented Sep 3, 2025

Enhancement - Guidelines

Documentation and Context

Code Standards and Practices

Testing

Additional Checks

Uh oh!

terrancedejesus commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikaayenson commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrancedejesus commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Mikaayenson Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

eric-forte-elastic Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Mikaayenson Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mikaayenson commented Sep 3, 2025 •

edited

Loading

terrancedejesus commented Sep 4, 2025 •

edited

Loading

Mikaayenson commented Sep 4, 2025 •

edited

Loading

Mikaayenson Sep 8, 2025 •

edited

Loading