Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect repeated STL container lookups #123376

Open
firewave opened this issue Jan 17, 2025 · 23 comments
Open

detect repeated STL container lookups #123376

firewave opened this issue Jan 17, 2025 · 23 comments
Labels
check-request Request for a new check in clang-tidy clang:static analyzer

Comments

@firewave
Copy link

@kazutakahirata has been fixing a lot of duplicated lookups with STL containers in recent months - like
c531255
94e9813

I do not know how these were found but it would be great if clang-tidy could detect these.

See the history for way more examples: https://github.com/llvm/llvm-project/commits?author=kazutakahirata.

@firewave firewave added check-request Request for a new check in clang-tidy clang-tidy labels Jan 17, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 17, 2025

@llvm/issue-subscribers-clang-tidy

Author: Oliver Stöneberg (firewave)

@kazutakahirata has been fixing a lot of duplicated lookups with STL containers in recent months - like c531255 94e9813

I do not know how these were found but it would be great if clang-tidy could detect these.

See the history for way more examples: https://github.com/llvm/llvm-project/commits?author=kazutakahirata.

@nicovank
Copy link
Contributor

I think this is a job for the static analyzer. It can keep track of the last lookup for each map, and on lookup check if the saved value equals the new value. I wrote something similar in Infer for a different purpose (removing some false positives).

I actually had started writing a CSA MapDoubleLookup version of such a check a few months ago but I'm not familiar with it so it took time. Is this something that you can see being added to the analyzer as a new check @steakhal @NagyDonat @Xazax-hun @haoNoQ?

@kazutakahirata
Copy link
Contributor

Yeah, automated discovery would be nice. I'm using clang-query to find Map[Item] and then look for find, count, and contains a few lines prior to Map[Item], so the process is half automated.

There are quite a few variations, but here are some common ones:

if (Map.contains(Item))  // or count
  // Read from Map[Item];

"then" and "else" may be flipped:

if (!Map.contains(Item))
  break; // or continue or return.
// Read from Map[Item].

Checking for membership in advance:

if (!Map.contains(Item))
  Map[Item] = 123;

One challenge is to examine the side effects between the membership check and insertion:

if (!Map.contains(Item)) {
  auto Val = foo();
  Map[Item] = Val;
}

Here, foo() might rely on the fact that Map does not contain Item yet. Inserting an element ahead of time in the "if" condition may change the semantics of the original program.

@nicovank
Copy link
Contributor

Here, foo() might rely on the fact that Map does not contain Item yet.

This is why I think inter-procedural analysis is necessary. Plus all the logic regarding methods invalidating references would probably be easier to implement.

@llvmbot
Copy link
Member

llvmbot commented Jan 19, 2025

@llvm/issue-subscribers-clang-static-analyzer

Author: Oliver Stöneberg (firewave)

@kazutakahirata has been fixing a lot of duplicated lookups with STL containers in recent months - like c531255 94e9813

I do not know how these were found but it would be great if clang-tidy could detect these.

See the history for way more examples: https://github.com/llvm/llvm-project/commits?author=kazutakahirata.

@steakhal
Copy link
Contributor

Here, foo() might rely on the fact that Map does not contain Item yet.

This is why I think inter-procedural analysis is necessary. Plus all the logic regarding methods invalidating references would probably be easier to implement.

Uh, that should be possible to do, but in general really hard to achieve.
Tracking how and where objects escape is a pain in the neck as soon as you start reasoning about globals and pointers.
Like with everything, this should be implemented in baby-steps, and refine the implementation and take shortcuts. Otherwise, if we play by the book, it's not going to be a useful checker.

That said, I hacked together a checker that should work in general - although I don't reason about the number of contained elements in a container.

My idea was to look for set/map "lookups", while looking at who could access and mutate the container between the "lookups". If it was mutated or escaped, then conservatively assume that the second "lookup" makes sense and not warn.

You can find my alpha checker at my branch.
Remember, this is not production ready, likely has a bunch of bugs and limitations. Maybe crashes as well.
Anyone can use it, just let me know you use it and give attribution accordingly.

I don't think I'll have the time to ever come back to this checker, so anyone willing to push it forward and publish it as a PR is more than welcomed.


Have you used weggli for coming up with a query to find refactoring opportunities?
It appears to my a bit more lightweight than writing AST matchers and clang-query. WDYT @kazutakahirata

@Xazax-hun
Copy link
Collaborator

My idea was to look for set/map "lookups", while looking at who could access and mutate the container between the "lookups". If it was mutated or escaped, then conservatively assume that the second "lookup" makes sense and not warn.

Absolutely, this is the biggest challenge. It is very easy to run into iterator invalidation or similar errors when the map ends up being mutated between the lookups.

On the other hand, if Clang ever adopts something akin to a borrow checker and that information is available to tidy or the static analyzer, for lifetime annotated code this question will be trivial to answer. This direction is not really relevant now but wanted to mention it just in case it happens in the next couple of years and someone takes a stab at this problem after that.

@steakhal
Copy link
Contributor

Now that I think about it, in my prototype, I should have restricted the checker to only consider double lookups if they are present in the same function body (aka. LocarionContext).

@nicovank
Copy link
Contributor

This is what I was playing around with a couple months ago: MapDoubleLookup.cpp. It was specialized to maps and handling the different methods one-by-one. I might get back to it but not in the immediate future and take things from / build upon your draft.

I should have restricted the checker to only consider double lookups if they are present in the same function body (aka. LocarionContext).

That's one thing I wanted to do but couldn't figure out how (first time touching CSA).

@firewave
Copy link
Author

@kazutakahirata I was wondering if the query logic could possibly "just" be encapsulated in code to get this somehow started.

And then suddenly this appeared: #123734 (I am always amazed by such coincidences).

@kazutakahirata
Copy link
Contributor

Have you used weggli for coming up with a query to find refactoring opportunities? It appears to my a bit more lightweight than writing AST matchers and clang-query. WDYT @kazutakahirata

No, I've never heard of it, but I just tried it. It's super fast! I'll keep it in my tool box. Thanks!

To find repeated map lookups, however, I do care about the type of Map in Map[Item]. I'm not sure weggli can accommodate that.

@kazutakahirata
Copy link
Contributor

@kazutakahirata I was wondering if the query logic could possibly "just" be encapsulated in code to get this somehow started.

I'm happy to share my clang-query script below, but it just finds Map[Item]. It doesn't look for a duplicate lookup on its own.

let hasMapName hasAnyName("::std::map", "::std::unordered_map", "::llvm::DenseMap", "::llvm::SmallDenseMap", "::llvm::MapVector", "::llvm::SmallMapVector", "::llvm::StringMap")

let hasTypeMap ignoringImpCasts(hasType(hasUnqualifiedDesugaredType(recordType(hasDeclaration(classTemplateSpecializationDecl(hasMapName))))))

match cxxOperatorCallExpr(hasOverloadedOperatorName("[]"),hasArgument(0,hasTypeMap))

@steakhal
Copy link
Contributor

This is what I was playing around with a couple months ago: MapDoubleLookup.cpp. It was specialized to maps and handling the different methods one-by-one. I might get back to it but not in the immediate future and take things from / build upon your draft.

I should have restricted the checker to only consider double lookups if they are present in the same function body (aka. LocarionContext).

That's one thing I wanted to do but couldn't figure out how (first time touching CSA).

I've just pushed a commit restricting the issues only for a single stack frame. So we shouldn't have redundant lookup reports from different stack frames.

@steakhal
Copy link
Contributor

I gave a bit of a polish to my alpha.cplusplus.RedundantLookupChecker checker and gave it a test run on the StaticAnalyzer sources.
Interestingly, I found a single TP:

diff --git a/clang/lib/StaticAnalyzer/Core/ExplodedGraph.cpp b/clang/lib/StaticAnalyzer/Core/ExplodedGraph.cpp
--- a/clang/lib/StaticAnalyzer/Core/ExplodedGraph.cpp
+++ b/clang/lib/StaticAnalyzer/Core/ExplodedGraph.cpp
@@ -488,15 +488,17 @@ ExplodedGraph::trim(ArrayRef<const NodeTy *> Sinks,
   while (!WL2.empty()) {
     const ExplodedNode *N = WL2.pop_back_val();
 
+    auto [Place, Inserted] = Pass2.try_emplace(N);
+
     // Skip this node if we have already processed it.
-    if (Pass2.contains(N))
+    if (!Inserted)
       continue;
 
     // Create the corresponding node in the new graph and record the mapping
     // from the old node to the new node.
     ExplodedNode *NewN = G->createUncachedNode(N->getLocation(), N->State,
                                                N->getID(), N->isSink());
-    Pass2[N] = NewN;
+    Place->second = NewN;
 
     // Also record the reverse mapping from the new node to the old node.
     if (InverseMap) (*InverseMap)[NewN] = N;

It takes ages to run it on my personal computer, so anyone interested could give it a try.
Basically, what I did was, I built my custom clang, made sure it's present in my PATH, and then used CodeChecker like this:

CodeChecker analyze --analyzers clangsa --enable=core --enable alpha.cplusplus.RedundantLookupChecker compile_commands.json -o result
CodeChecker parse result

Given this outcome so far, I'd be interested in running this on the whole clang/llvm.

@steakhal

This comment has been minimized.

@firewave
Copy link
Author

It would probably be interesting to run it on a version of the code before Kazu started with his cleanups. I think they started around September 3 2024.

It takes ages to run it on my personal computer, so anyone interested could give it a try.

Depending on the code CSA can be really slow. For Cppcheck it takes about 2.5x the time the clang-tidy checks need (and the latter might still have one or two hot spots as well).

steakhal added a commit that referenced this issue Jan 30, 2025
I found this using my experimental checker present at:
https://github.com/steakhal/llvm-project/tree/bb/add-redundant-lookup-checker

The idea for looking for redundant container lookups was inspired by
#123376

If there is interest, I could think of upstreaming this alpha checker.
(For the StaticAnalyzer sources it was the only TP, and I had no FPs
from the checker btw.)
@steakhal
Copy link
Contributor

It would probably be interesting to run it on a version of the code before Kazu started with his cleanups. I think they started around September 3 2024.

Thanks for the suggestion, I was also thinking about it but you convinced me to actually do it.

I reverted the following 19 commits, and then created a small compilation_database.json with their build commands to not waste compute resources analyzing uninteresting files.
Commits: 94e9813, 8baa0d9, 1d5ce61, 817e777, 850852e, 0cc74a8, 72918fd, 19a7fe0, 24892b8, 3d15bfb, d5ef2c0, 09bf5b0, c531255, ccc066e, 5d2393a, 8ad4f1a, bb019dd, 6a5a795, fa9fb2a

Then, I ran my checker using CodeChecker, as I explained earlier, and looked at the findings:

  • The double lookups were found in the wast majority of the cases by my checker, and usually many more findings that were not fixed yet. (This totally makes sense as usually a commit only fixed 1 case, so they are not per file granularity).

  • I haven't seen any FPs of my checker. Which is great news.

  • There were patterns that my checker deliberately ignored, for instance if a lookup key is the result of a function. Then I assume that the function call may have sideffects, thus the double lookup is intentional. I think I should revisit this to achieve wider coverage (aka. more reports, thus more TPs).

  • I've seen only a handful of cases that I though my checker should catch, yet it missed. I'll have a look at those and figure out what is missing.

  • The checker reports use the usual path-sensitive reporting, which usually makes the reports quite lengthy without delivering much value. I'm considering switching to old-school basic diagnostic pieces to only mention parts between the first lookup and the second lookup.

Overall, I'm content with the checker.

@steakhal
Copy link
Contributor

I've pushed my final changes to the checker. Now, it stringifies the lookup "key" to see if it's any different across the two lookups. Not pretty, but gets the job done for now. I also checked the reporting, to only emit the two notes for the two lookups if finds.

I'm convinced that the Static Analyzer is not the right tool for catching these bugs, as in my implementation I gradually relaxed basically everything that would it tie to the symbolic execution engine.

Consequently, it would be likely easier to implement the same using a well crafted AST-based checker in clang-tidy. That would also solve the performance problems by not paying for what you don't use - unlike using a CSA checker, where you need to do the full-blown exploded graph exploration to then later just discard the benefits it would bring - like in my checker.

So, all in all, I'm happy with my checker (finds TPs, without any FPs), but it's really heavy weight, and this should be done in clang-tidy.

@steakhal
Copy link
Contributor

So, all in all, I'm happy with my checker (finds TPs, without any FPs), but it's really heavy weight, and this should be done in clang-tidy.

Frankly, I couldn't hold myself back. :D
Of course, I implemented this in clang-tidy too, using a two-phase approach:
Match all "lookup" expressions ("count", "contains", "operator[]" of some object that has "map|set" in their name) per function and container object.
Later, once we finished with the TU, take the groups that have at least 2 lookups of the same container object within a function and report the first one while noting the rest of the lookup places:
Image

In theory, this could have FPs due to control-flow or mutation to the container between the lookups, but in practice that doesn't seem to happen too often once I looked at some of the reports my checker produced.
I've seen so far such a FP once.

I ignore "lookups" within macros, for example assert(map.count(key)]).

I did the validation again on the same set of commits, and my checker could find all of them and more!
I run my checker on clang itself, and I found roughly 9000 hits. (I'll try to host these at some place at some point)

I pushed my code to the same branch.
@kazutakahirata WDYT? Are you interested in the clang-tidy check, would it be useful?

@kazutakahirata
Copy link
Contributor

I pushed my code to the same branch. @kazutakahirata WDYT? Are you interested in the clang-tidy check, would it be useful?

Wow! Thank you so much for all of your efforts into this! Yes, I'm interested in the clang-tidy check. Dramatically narrowing down the scope is very useful in general (that is, all the functions with at least two lookups). I'm not worried too much about false positives. While I haven't tried your code, I am wondering if it's helpful to be able to specify the maximum distance between two lookups in terms of line numbers (say, at most 10 lines apart or something).

Let me take some time to catch up on this issue. Thanks again!

@firewave
Copy link
Author

firewave commented Feb 1, 2025

I think it might sense to publish this as a PR to get more eyes on it and gather more feedback.

@EugeneZelenko
Copy link
Contributor

I think will be good idea to add next pattern, which I observed on code base at my work:

if (container.find(key) == container.end())
    container.insert(key)

or

if (container.find(key) == container.end())
{
    ...
    container.insert(key)
    ...
}

Same for map-like containers with value.

@steakhal
Copy link
Contributor

steakhal commented Feb 2, 2025

I think it might sense to publish this as a PR to get more eyes on it and gather more feedback.

Posted the PR at #125420. Follow the technical discussion there about the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check-request Request for a new check in clang-tidy clang:static analyzer
Projects
None yet
Development

No branches or pull requests

7 participants