-
Notifications
You must be signed in to change notification settings - Fork 598
[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation #5059
Conversation
Enhancement - GuidelinesThese guidelines serve as a reminder set of considerations when addressing adding a feature to the code. Documentation and Context
Code Standards and Practices
Testing
Additional Checks
|
@Mikaayenson have we tried a rule or query that is truly separate data sources (separate integrations)? Like Okta and Azure Activity logs? The rule mentioned is Azure integration, but Entra ID Protection logs and Entra ID Audit logs as separate data streams. Similar to how we correlate Entra ID Sign ins to Microsoft Graph activity here, but its an ESQL rule. The closest I believe to true separate data sources is this Okta rule which looks at Okta system logs and any logs reported by a Windows endpoint, but does not use |
@terrancedejesus Did you see the unit tests? |
rgr, thanks for sharing. I see from the testing we do the following: 1 integration:2+ datastreams That covers my question. Thank you! |
…hub.com:elastic/detection-rules into support-multidatasource-eql-integration-queries
build_rule(query, "kuery") | ||
|
||
@unittest.skip("Redundant with new validation?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you are correct, will double check.
join_fields = ", ".join(map(str, getattr(subquery, "join_values", []) or [])) | ||
dummy_by = f" by {join_fields}" if join_fields else "" | ||
return f"sequence\n {subquery_text}\n [any where true]{dummy_by}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought initially about just appending runs=2
instead, but then we loose validation on the join by fields.
return f"sequence\n {subquery_text} with runs=2"
Pull Request
Issue link(s):
Summary - What I changed
In #4688 @Samirbous is adding the first multi-dataset query to the repo. His PR leveraged EQL to correlate across different datasources per subquery. This PR refactors the integration validation to support multiple datasources used within an eql sequence query (multiple packages with a single integration, multiple integrations).
Important
Instead of validating an entire eql sequence query with a single merged schema, we're not validating subqueries individually with the proper schemas.
To cleanup some of the
# type: ignore[reportUnknownVariableType]
litter, might be good to@typing.no_type_check
As part of this large refactor, the major change is that previously, we had several branching conditions and multiple validation calls per query just to double check validation. Now for each rule, we build a validation plan by pulling all the right schemas needed. Then execute that validation plan.
Warning
Bumping this to a minor version bump as it may break validation for users (now that we're identifying new potential errors).
How To Test
Failing Rules
Each have to be manually checked: double_check_siem_rules.txt
Note
Unit test will fail until these rules are tuned.
#5072
Additional Context
EQL’s parser accepts a single flat schema per parse. It has no concept of “schema scoped by dataset per subquery.” If you pass the whole sequence with a merged schema, you lose the ability to enforce that each subquery uses only the fields from its own integration/package.
Why not validate once with a merged schema
Why per-subquery validation is necessary
Checklist
bug
,enhancement
,schema
,maintenance
,Rule: New
,Rule: Deprecation
,Rule: Tuning
,Hunt: New
, orHunt: Tuning
so guidelines can be generatedmeta:rapid-merge
label if planning to merge within 24 hours