Skip to content

Conversation

Mikaayenson
Copy link
Contributor

@Mikaayenson Mikaayenson commented Sep 3, 2025

Pull Request

Issue link(s):

Summary - What I changed

In #4688 @Samirbous is adding the first multi-dataset query to the repo. His PR leveraged EQL to correlate across different datasources per subquery. This PR refactors the integration validation to support multiple datasources used within an eql sequence query (multiple packages with a single integration, multiple integrations).

Important

Instead of validating an entire eql sequence query with a single merged schema, we're not validating subqueries individually with the proper schemas.

To cleanup some of the # type: ignore[reportUnknownVariableType] litter, might be good to @typing.no_type_check

As part of this large refactor, the major change is that previously, we had several branching conditions and multiple validation calls per query just to double check validation. Now for each rule, we build a validation plan by pulling all the right schemas needed. Then execute that validation plan.

Warning

Bumping this to a minor version bump as it may break validation for users (now that we're identifying new potential errors).

How To Test

  • Added a new unit test class to validate, so CI should pass. I also refactored the test to have more consistent formatting in tests/test_python_library.py
  • Ensure we're not inadvertently breaking sequence validation
  • Ensure we didn't introduce regressions in DaC auto schema generation
  • In testing I identified several rule now failing validation (often when beats schemas were added to rules, where fields were not present in those beats schemas).
Failing Rules

Each have to be manually checked: double_check_siem_rules.txt

Note

Unit test will fail until these rules are tuned.

#5072

Additional Context

EQL’s parser accepts a single flat schema per parse. It has no concept of “schema scoped by dataset per subquery.” If you pass the whole sequence with a merged schema, you lose the ability to enforce that each subquery uses only the fields from its own integration/package.

Why not validate once with a merged schema

  • Superset masking: A field from integration A will exist in the merged schema even when you’re in a subquery whose dataset is integration B. The parse will succeed, and you won’t catch the misuse.
  • Type conflicts: Different packages can define the same field name with different types. A merged map can pick one type arbitrarily or last-wins, producing wrong acceptance or wrong errors.
  • Ambiguous errors: Even if you detect an error, you can’t attribute it cleanly to “subquery X vs package Y” because the validation had no subquery boundary.

Why per-subquery validation is necessary

  • Flat-schema constraint: EQL validates against one field-type map at a time. To emulate “dataset scoping,” we parse each subquery with only the fields from the dataset’s integration (plus ECS/index/custom as needed).
  • Correctness by construction: If a subquery references a field from another package, it won’t be present in that subquery’s schema, and the parser raises “Unknown field” (or “Field not recognized” with proper trailer).
  • Clear attribution: You get an error bound to the specific subquery and its intended package, which is actionable.

Checklist

  • Added a label for the type of pr: bug, enhancement, schema, maintenance, Rule: New, Rule: Deprecation, Rule: Tuning, Hunt: New, or Hunt: Tuning so guidelines can be generated
  • Added the meta:rapid-merge label if planning to merge within 24 hours
  • Secret and sensitive material has been managed correctly
  • Automated testing was updated or added to match the most common scenarios
  • Documentation and comments were added for features that require explanation

@Mikaayenson Mikaayenson self-assigned this Sep 3, 2025
@Mikaayenson Mikaayenson added enhancement New feature or request test-suite unit and other testing components python Internal python for the repository labels Sep 3, 2025
Copy link
Contributor

github-actions bot commented Sep 3, 2025

Enhancement - Guidelines

These guidelines serve as a reminder set of considerations when addressing adding a feature to the code.

Documentation and Context

  • Describe the feature enhancement in detail (alternative solutions, description of the solution, etc.) if not already documented in an issue.
  • Include additional context or screenshots.
  • Ensure the enhancement includes necessary updates to the documentation and versioning.

Code Standards and Practices

  • Code follows established design patterns within the repo and avoids duplication.
  • Ensure that the code is modular and reusable where applicable.

Testing

  • New unit tests have been added to cover the enhancement.
  • Existing unit tests have been updated to reflect the changes.
  • Provide evidence of testing and validating the enhancement (e.g., test logs, screenshots).
  • Validate that any rules affected by the enhancement are correctly updated.
  • Ensure that performance is not negatively impacted by the changes.
  • Verify that any release artifacts are properly generated and tested.
  • Conducted system testing, including fleet, import, and create APIs (e.g., run make test-cli, make test-remote-cli, make test-hunting-cli)

Additional Checks

  • Verify that the enhancement works across all relevant environments (e.g., different OS versions).
  • Confirm that the proper version label is applied to the PR patch, minor, major.

@Mikaayenson Mikaayenson marked this pull request as draft September 3, 2025 21:13
@Mikaayenson Mikaayenson marked this pull request as ready for review September 3, 2025 21:41
@terrancedejesus
Copy link
Contributor

terrancedejesus commented Sep 4, 2025

@Mikaayenson have we tried a rule or query that is truly separate data sources (separate integrations)? Like Okta and Azure Activity logs? The rule mentioned is Azure integration, but Entra ID Protection logs and Entra ID Audit logs as separate data streams. Similar to how we correlate Entra ID Sign ins to Microsoft Graph activity here, but its an ESQL rule. The closest I believe to true separate data sources is this Okta rule which looks at Okta system logs and any logs reported by a Windows endpoint, but does not use event.dataset and thus we did not run into this support issue.

@Mikaayenson
Copy link
Contributor Author

Mikaayenson commented Sep 4, 2025

@Mikaayenson have we tried a rule or query that is truly separate data sources (separate integrations)? Like Okta and Azure Activity logs? The rule mentioned is Azure integration, but Entra ID Protection logs and Entra ID Audit logs as separate data streams. Similar to how we correlate Entra ID Sign ins to Microsoft Graph activity here, but its an ESQL rule. The closest I believe to true separate data sources is this Okta rule which looks at Okta system logs and any logs reported by a Windows endpoint, but does not use event.dataset and thus we did not run into this support issue.

@terrancedejesus Did you see the unit tests?

@terrancedejesus
Copy link
Contributor

@terrancedejesus Did you see the unit tests?

rgr, thanks for sharing. I see from the testing we do the following:

1 integration:2+ datastreams
2 integrations:2+datastreams

That covers my question. Thank you!

@Mikaayenson Mikaayenson marked this pull request as draft September 5, 2025 03:47
@Mikaayenson Mikaayenson changed the title [FR] Support Multi-Dataset Sequence Validation [FR] Refactor Schema Validation & Support Multi-Dataset Sequence Validation Sep 6, 2025
build_rule(query, "kuery")

@unittest.skip("Redundant with new validation?")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you are correct, will double check.

@Mikaayenson Mikaayenson marked this pull request as ready for review September 6, 2025 09:32
Comment on lines +361 to +363
join_fields = ", ".join(map(str, getattr(subquery, "join_values", []) or []))
dummy_by = f" by {join_fields}" if join_fields else ""
return f"sequence\n {subquery_text}\n [any where true]{dummy_by}"
Copy link
Contributor Author

@Mikaayenson Mikaayenson Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought initially about just appending runs=2 instead, but then we loose validation on the join by fields.

        return f"sequence\n  {subquery_text} with runs=2"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport: auto enhancement New feature or request minor python Internal python for the repository schema test-suite unit and other testing components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Integration Validation Missing Dataset Specific Schemas [Bug] EQL Sequence Multi-Data Source Schema Validation
5 participants