Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting and Measuring Coupling #5211

Open
jflo opened this issue Mar 13, 2023 · 0 comments
Open

Detecting and Measuring Coupling #5211

jflo opened this issue Mar 13, 2023 · 0 comments
Assignees

Comments

@jflo
Copy link
Contributor

jflo commented Mar 13, 2023

What are we trying to do: Modularity requires decoupling. Decoupling is a massive subject, how do we know where we will get most value for our time spent?

We tried measuring coupling directly with Sonar, but custom rules would be required. It has metrics for complexity, however the rules governing coupling were removed: https://community.sonarsource.com/t/how-to-measure-class-coupling-metric-with-sonarqube/2801/6

IntelliJ Ultimate edition provides tooling for Data Flow Analysis. This is promising, but would require complete source code for all paths we are interested in to be loaded into the IDE. It also is limited to tracking the flow of exact objects and primitives. It cannot tell that a new object took a traced object in as input, and continue tracking the flow. This is called "taint analysis" and is probably going to be critical to analyzing besu, since so much of our code is immutable. IntelliJ and Sonar have a way to provide taint analysis, but it is designed to help find security vulnerabilities. It doesn't seem to allow more flexibility in its use, which is required if we are looking to use this technique for refactoring triage instead of searching for exploits. For example, using a security focused taint analyzer would limit us to study of flow from "user supplied input" as it looks for things like SQL injections and XSS attacks. We couldn't ask questions like "when a new configuration option is introduced, how will it need to flow to get to where it is used".

CodeQL is the current favorite tool to look at, since it has explicit support for taint tracking, via its DataFlow apis. During general use, devs write queries in CodeQL, which run against a generated code database. GitHub already generates these databases, and they can be accessed directly via MS Visual Code. VSCode is the supported tool for developing in CodeQL, so devs will have to become comfortable with that. Custom CodeQL queries can be checked in to the besu repo and treated as "problem scans" if desired, and GithubActions could annotate pull requests with problems found.

CodeQL can be thought of as two different APIs: an Abstract Syntax Tree api and a DataFlow api. The AST api is language specific, and understands the building blocks of static source code. The AST api is easy to understand, and the tooling comes with a viewer that allows you to explore existing code to see how it fits into the CodeQL java specific model.

The DataFlow api is very different, and often incongruous with the AST api. DataFlow api is used to chart paths that data takes as it flows through the code, and has support for taint tracking. This support is limited by how broad your CodeQL database is; if you have not instrumented dependent libraries, then there is no way to track potential taint of inputs done by that code. This is a major problem for Besu in particular, which often extends classes provided by Apache Tuweni, or uses other cryptography libraries. Developers need a way to say "trust me, that is tainted" and move on with their analysis. I have been able to leverage this in some cases, but not all- classes which extend Tuweni types like Wei were particularly hard to track taint through. I suspect this is just a misunderstanding of how the disparate AST and DataFlow models interoperate, and have reached out to other CodeQL users for understanding.

At the suggestion of EF security staff, I'm also adding semgrep to the list of potential tools that can help figure this out. They have a whole section on taint analysis, but I have not tried it yet. This description here https://semgrep.dev/docs/writing-rules/data-flow/taint-mode/ sounds promising.

@macfarla macfarla added the TeamChupa GH issues worked on by Chupacabara Team label Mar 13, 2023
@jflo jflo removed the TeamChupa GH issues worked on by Chupacabara Team label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants