Skip to content

Support approximate related locations #19943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 11, 2025

Conversation

asgerf
Copy link
Contributor

@asgerf asgerf commented Jul 1, 2025

Some queries could not be made diff-informed because they select an entity whose location is not a Location. Specifically this is true for the RegExpTerms inside a regular expression.

This PR addresses this by adding the ability to filter based on a location containing the final selected location. We then filter based on the location of the entire RegExp literal, which is valid because it encloses the location of all its terms.

Unfortunately the Ruby query still can't be made diff-informed because the locations of its RegExpTerms aren't correct when the regexp is parsed from a string arising from constant folding. But it fixes it for Python and Java.

Initially in this PR the approximate location filtering was opt-in via a separate API, but after discussing with @jbj it is then made the default in a later commit.

@asgerf asgerf force-pushed the approximate-related-location branch from 75b4c30 to 4a2d795 Compare July 2, 2025 12:42
@jbj
Copy link
Contributor

jbj commented Jul 7, 2025

The code LGTM, and DCA looks good as far as I can tell. I inspected join orders manually by editing the following into default-alert-filter.yml and running java/polynomial-redos on a MRVA-obtained snapshot of apache/hadoop:

      - ["/home/runner/work/bulk-builder/bulk-builder/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPFileSystem.java", 296, 296]
      - ["/home/runner/work/bulk-builder/bulk-builder/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPFileSystem.java", 10000, 109999]

This alert filter marks line 296, which contains a source for the query, and 100,000 unrelated (non-existent) lines just to see some big tuple counts. Note: the line number 296 may be out of date.

We get the following tuple counts for isFilteredSource (the caller of the inline predicate filterByLocation):

[2025-07-07 08:38:23] Evaluated non-recursive predicate PolynomialReDoSQuery::PolynomialRedosFlow::Stage1::SourceSinkFiltering::isFilteredSource/1#40b89ec6@17525co4 in 8ms (size: 1).
Evaluated relational algebra for predicate PolynomialReDoSQuery::PolynomialRedosFlow::Stage1::SourceSinkFiltering::isFilteredSource/1#40b89ec6@17525co4 with tuple counts:
          2029   ~0%    {1} r1 = JOIN `DataFlowImpl::MakeImpl<Location::Location,DataFlowImplSpecific::JavaDataFlow>::DefaultState<PolynomialReDoSQuery::PolynomialRedosConfig>::isSource/2#7cf36af0#cpe#1` WITH `DataFlowUtil::Node.getLocation/0#dispred#eae77cf5` ON FIRST 1 OUTPUT Lhs.0
             0   ~0%    {1}    | AND NOT restrictAlertsTo#nonempty(FIRST 0)
             0   ~0%    {1}    | AND NOT restrictAlertsToExactLocation#nonempty(FIRST 0)
                    
          2029   ~0%    {2} r2 = JOIN `DataFlowImpl::MakeImpl<Location::Location,DataFlowImplSpecific::JavaDataFlow>::DefaultState<PolynomialReDoSQuery::PolynomialRedosConfig>::isSource/2#7cf36af0#cpe#1` WITH `DataFlowUtil::Node.getLocation/0#dispred#eae77cf5` ON FIRST 1 OUTPUT Rhs.1, Lhs.0
                    
          2029   ~4%    {2} r3 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Rhs.1, Lhs.1
             0   ~0%    {1}    | JOIN WITH `AlertFiltering::AlertFilteringImpl<Location::Location>::restrictAlertsToEntireFile/1#c1a92d11` ON FIRST 1 OUTPUT Lhs.1
                    
          2029   ~0%    {4} r4 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Rhs.2, Rhs.4
        300003   ~0%    {4}    | JOIN WITH `AlertFiltering::AlertFilteringImpl<Location::Location>::restrictAlertsToStartLine/2#a32b1746` ON FIRST 1 OUTPUT Lhs.1, Lhs.2, Lhs.3, Rhs.1
             1   ~0%    {4}    | JOIN WITH PRIMITIVE range#bbb ON Lhs.1,Lhs.2,Lhs.3
             1   ~0%    {1}    | SCAN OUTPUT In.0
                    
          2029   ~5%    {6} r5 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Rhs.2, Rhs.3, Rhs.4, Rhs.5
             0   ~0%    {9}    | JOIN WITH restrictAlertsToExactLocation ON FIRST 1 OUTPUT Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5, Rhs.1, Rhs.2, Rhs.3, Rhs.4
             0   ~0%    {9}    | REWRITE WITH NOT [TEST InOut.1 = InOut.5, NOT [TEST InOut.2 <= InOut.6]]
             0   ~0%    {9}    | REWRITE WITH NOT [TEST InOut.3 = InOut.7, NOT [TEST InOut.4 >= InOut.8]]
             0   ~0%    {9}    | REWRITE WITH TEST InOut.1 <= InOut.5
                        {9}    | REWRITE WITH TEST InOut.3 >= InOut.7
             0   ~0%    {1}    | SCAN OUTPUT In.0
                    
             1   ~0%    {1} r6 = r1 UNION r3 UNION r4 UNION r5
                        return r6

There is a slight problem where we get 300,003 tuples, which is the number of sources in the file times the number of lines in the diff range in that file. To avoid that risk of blow-up, I propose calling restrictAlertsToStartLine like this instead:

restrictAlertsToStartLine(pragma[only_bind_into](filePath), [locStartLine .. locEndLine])

Now the join order becomes more sensible, with the tuple numbers increasing only slightly to allow for multi-line sources:

[2025-07-07 08:41:52] Evaluated non-recursive predicate PolynomialReDoSQuery::PolynomialRedosFlow::Stage1::SourceSinkFiltering::isFilteredSource/1#40b89ec6@30f329uu in 3ms (size: 1).
Evaluated relational algebra for predicate PolynomialReDoSQuery::PolynomialRedosFlow::Stage1::SourceSinkFiltering::isFilteredSource/1#40b89ec6@30f329uu with tuple counts:
        2029   ~0%    {1} r1 = JOIN `DataFlowImpl::MakeImpl<Location::Location,DataFlowImplSpecific::JavaDataFlow>::DefaultState<PolynomialReDoSQuery::PolynomialRedosConfig>::isSource/2#7cf36af0#cpe#1` WITH `DataFlowUtil::Node.getLocation/0#dispred#eae77cf5` ON FIRST 1 OUTPUT Lhs.0
           0   ~0%    {1}    | AND NOT restrictAlertsTo#nonempty(FIRST 0)
           0   ~0%    {1}    | AND NOT restrictAlertsToExactLocation#nonempty(FIRST 0)
                  
        2029   ~0%    {2} r2 = JOIN `DataFlowImpl::MakeImpl<Location::Location,DataFlowImplSpecific::JavaDataFlow>::DefaultState<PolynomialReDoSQuery::PolynomialRedosConfig>::isSource/2#7cf36af0#cpe#1` WITH `DataFlowUtil::Node.getLocation/0#dispred#eae77cf5` ON FIRST 1 OUTPUT Rhs.1, Lhs.0
                  
        2029   ~4%    {2} r3 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Rhs.1, Lhs.1
           0   ~0%    {1}    | JOIN WITH `AlertFiltering::AlertFilteringImpl<Location::Location>::restrictAlertsToEntireFile/1#c1a92d11` ON FIRST 1 OUTPUT Lhs.1
                  
        2029   ~1%    {4} r4 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Lhs.1, Rhs.1, Rhs.2, Rhs.4
        2231   ~0%    {5}    | JOIN WITH PRIMITIVE range#bbf ON Lhs.2,Lhs.3
        2231   ~1%    {3}    | SCAN OUTPUT In.1, In.4, In.0
           1   ~0%    {1}    | JOIN WITH `AlertFiltering::AlertFilteringImpl<Location::Location>::restrictAlertsToStartLine/2#a32b1746` ON FIRST 2 OUTPUT Lhs.2
                  
        2029   ~5%    {6} r5 = JOIN r2 WITH `Location::Location.hasLocationInfo/5#dispred#c91623d0` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Rhs.2, Rhs.3, Rhs.4, Rhs.5
           0   ~0%    {9}    | JOIN WITH restrictAlertsToExactLocation ON FIRST 1 OUTPUT Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5, Rhs.1, Rhs.2, Rhs.3, Rhs.4
           0   ~0%    {9}    | REWRITE WITH NOT [TEST InOut.1 = InOut.5, NOT [TEST InOut.2 <= InOut.6]]
           0   ~0%    {9}    | REWRITE WITH NOT [TEST InOut.3 = InOut.7, NOT [TEST InOut.4 >= InOut.8]]
           0   ~0%    {9}    | REWRITE WITH TEST InOut.1 <= InOut.5
                      {9}    | REWRITE WITH TEST InOut.3 >= InOut.7
           0   ~0%    {1}    | SCAN OUTPUT In.0
                  
           1   ~0%    {1} r6 = r1 UNION r3 UNION r4 UNION r5
                      return r6

private predicate restrictAlertsToEntireFile(string filePath) { restrictAlertsTo(filePath, 0, 0) }

pragma[nomagic]
private predicate restrictAlertsToStartLine(string filePath, int line) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Despite the name, this predicate does not seem to treat the start line in any special way. Perhaps it can be renamed to restrictAlertsToLine?

jbj added 3 commits July 9, 2025 09:24
It's better to join with the range expression first since that will only
multiply tuple counts by the number of lines in an average source/sink.
Joining with `restrictAlertsToStartLine` first would multiply tuple
counts by the number of sources/sinks in a given file.
@jbj jbj marked this pull request as ready for review July 9, 2025 08:11
@jbj jbj requested review from a team as code owners July 9, 2025 08:11
@jbj jbj added the no-change-note-required This PR does not need a change note label Jul 9, 2025
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following QL doc for restrictAlertsTo needs to be updated:

and its start line is between `startLineStart` and `startLineEnd`, inclusive. (Note that only start line of the location is used for matching because an alert is displayed on the first line of its location.)

E.g. to

and its line span is between `startLineStart` and `startLineEnd`, inclusive.

The documentation is now up-to-date with the new and more relaxed rules
that allow overapproximating the results. I have also attempted to make
a clearer distinction between the requirements of the specification and
the behaviour of the implementation.
@jbj
Copy link
Contributor

jbj commented Jul 9, 2025

The following QL doc for restrictAlertsTo needs to be updated

I can see the docs were confusing because they had not been kept up to date with the evolution of the requirements. I've now rearranged them and tried to distinguish clearly between the spec of diff-informed queries in restrictAlertsTo, which only requires looking at the start line, and the implementation in filterByLocation, which now looks at the end line too.

* excluded from the query results if only if (1) there is at least one active filtering predicate,
* and (2) it is not accepted by any active filtering predicate.
* effect. If it is active, queries may omit alerts that don't have a _primary_ or _related_
* location (in SARIF terminology) whose start line is within a specified range. Queries are allowed
Copy link
Contributor

@hvitved hvitved Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whose start line is within a specified range

But doesn't the new implementation change this to any line spanned by the location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The new implementation includes more alerts than strictly necessary, but that's allowed. We include these extra alerts in order to support regex queries in a simple way with the Location-based API we have now even though there's no Location object that directly corresponds to the alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was: shouldn't the QL doc be updated to reflect this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc on restrictAlertsTo reflects the contract between the CodeQL queries and the CodeQL action. That contract hasn't changed, except that I've now updated the doc to reflect the contract changes that were made a while ago (that overapproximations are acceptable). This contract is in terms of file names and line numbers. The changes in this PR are about how we relate Location objects to file names and line numbers in the filterByLocation helper predicate.

Queries don't need to use filterByLocation. If a future query uses restrictAlertsTo directly, the QLDoc on that predicate should still describe the precise contract so the query has freedom to overapproximate it or not in the most efficient manner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.

@jbj jbj merged commit 76544f2 into github:main Jul 11, 2025
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java no-change-note-required This PR does not need a change note Python Ruby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants