You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What are we trying to do: Modularity requires decoupling. Decoupling is a massive subject, how do we know where we will get most value for our time spent?
IntelliJ Ultimate edition provides tooling for Data Flow Analysis. This is promising, but would require complete source code for all paths we are interested in to be loaded into the IDE. It also is limited to tracking the flow of exact objects and primitives. It cannot tell that a new object took a traced object in as input, and continue tracking the flow. This is called "taint analysis" and is probably going to be critical to analyzing besu, since so much of our code is immutable. IntelliJ and Sonar have a way to provide taint analysis, but it is designed to help find security vulnerabilities. It doesn't seem to allow more flexibility in its use, which is required if we are looking to use this technique for refactoring triage instead of searching for exploits. For example, using a security focused taint analyzer would limit us to study of flow from "user supplied input" as it looks for things like SQL injections and XSS attacks. We couldn't ask questions like "when a new configuration option is introduced, how will it need to flow to get to where it is used".
CodeQL is the current favorite tool to look at, since it has explicit support for taint tracking, via its DataFlow apis. During general use, devs write queries in CodeQL, which run against a generated code database. GitHub already generates these databases, and they can be accessed directly via MS Visual Code. VSCode is the supported tool for developing in CodeQL, so devs will have to become comfortable with that. Custom CodeQL queries can be checked in to the besu repo and treated as "problem scans" if desired, and GithubActions could annotate pull requests with problems found.
CodeQL can be thought of as two different APIs: an Abstract Syntax Tree api and a DataFlow api. The AST api is language specific, and understands the building blocks of static source code. The AST api is easy to understand, and the tooling comes with a viewer that allows you to explore existing code to see how it fits into the CodeQL java specific model.
The DataFlow api is very different, and often incongruous with the AST api. DataFlow api is used to chart paths that data takes as it flows through the code, and has support for taint tracking. This support is limited by how broad your CodeQL database is; if you have not instrumented dependent libraries, then there is no way to track potential taint of inputs done by that code. This is a major problem for Besu in particular, which often extends classes provided by Apache Tuweni, or uses other cryptography libraries. Developers need a way to say "trust me, that is tainted" and move on with their analysis. I have been able to leverage this in some cases, but not all- classes which extend Tuweni types like Wei were particularly hard to track taint through. I suspect this is just a misunderstanding of how the disparate AST and DataFlow models interoperate, and have reached out to other CodeQL users for understanding.
At the suggestion of EF security staff, I'm also adding semgrep to the list of potential tools that can help figure this out. They have a whole section on taint analysis, but I have not tried it yet. This description here https://semgrep.dev/docs/writing-rules/data-flow/taint-mode/ sounds promising.
The text was updated successfully, but these errors were encountered:
What are we trying to do: Modularity requires decoupling. Decoupling is a massive subject, how do we know where we will get most value for our time spent?
We tried measuring coupling directly with Sonar, but custom rules would be required. It has metrics for complexity, however the rules governing coupling were removed: https://community.sonarsource.com/t/how-to-measure-class-coupling-metric-with-sonarqube/2801/6
IntelliJ Ultimate edition provides tooling for Data Flow Analysis. This is promising, but would require complete source code for all paths we are interested in to be loaded into the IDE. It also is limited to tracking the flow of exact objects and primitives. It cannot tell that a new object took a traced object in as input, and continue tracking the flow. This is called "taint analysis" and is probably going to be critical to analyzing besu, since so much of our code is immutable. IntelliJ and Sonar have a way to provide taint analysis, but it is designed to help find security vulnerabilities. It doesn't seem to allow more flexibility in its use, which is required if we are looking to use this technique for refactoring triage instead of searching for exploits. For example, using a security focused taint analyzer would limit us to study of flow from "user supplied input" as it looks for things like SQL injections and XSS attacks. We couldn't ask questions like "when a new configuration option is introduced, how will it need to flow to get to where it is used".
CodeQL is the current favorite tool to look at, since it has explicit support for taint tracking, via its DataFlow apis. During general use, devs write queries in CodeQL, which run against a generated code database. GitHub already generates these databases, and they can be accessed directly via MS Visual Code. VSCode is the supported tool for developing in CodeQL, so devs will have to become comfortable with that. Custom CodeQL queries can be checked in to the besu repo and treated as "problem scans" if desired, and GithubActions could annotate pull requests with problems found.
CodeQL can be thought of as two different APIs: an Abstract Syntax Tree api and a DataFlow api. The AST api is language specific, and understands the building blocks of static source code. The AST api is easy to understand, and the tooling comes with a viewer that allows you to explore existing code to see how it fits into the CodeQL java specific model.
The DataFlow api is very different, and often incongruous with the AST api. DataFlow api is used to chart paths that data takes as it flows through the code, and has support for taint tracking. This support is limited by how broad your CodeQL database is; if you have not instrumented dependent libraries, then there is no way to track potential taint of inputs done by that code. This is a major problem for Besu in particular, which often extends classes provided by Apache Tuweni, or uses other cryptography libraries. Developers need a way to say "trust me, that is tainted" and move on with their analysis. I have been able to leverage this in some cases, but not all- classes which extend Tuweni types like Wei were particularly hard to track taint through. I suspect this is just a misunderstanding of how the disparate AST and DataFlow models interoperate, and have reached out to other CodeQL users for understanding.
At the suggestion of EF security staff, I'm also adding semgrep to the list of potential tools that can help figure this out. They have a whole section on taint analysis, but I have not tried it yet. This description here https://semgrep.dev/docs/writing-rules/data-flow/taint-mode/ sounds promising.
The text was updated successfully, but these errors were encountered: