-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: Instantiate model generation library #19295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR instantiates the model generation library for C++ by adding the complete library, tests, and associated extension configuration files to enable model generation via query summaries.
- Adds annotated model definitions and summaries in the library tests.
- Introduces tests (including instantiation of the inline expectation test framework) for verifying model generation.
- Provides extension YAML files to integrate summary models into the CodeQL pack.
Reviewed Changes
Copilot reviewed 4 out of 18 changed files in this pull request and generated no comments.
File | Description |
---|---|
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp | Adds model definitions with summary annotations for C++ dataflow. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ext.yml | Configures summary capture extension with manual models. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml | Configures content-based summary capture extension. |
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py | Introduces a utility script for generating C++ flow models. |
Files not reviewed (14)
- cpp/ql/lib/utils/test/InlineMadTest.qll: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureMixedNeutralModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureMixedSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureNeutralModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSinkModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSourceModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll: Language not supported
- cpp/ql/src/utils/modelgenerator/internal/CaptureModelsPrinting.qll: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.expected: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.expected: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ql: Language not supported
Comments suppressed due to low confidence (2)
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py:9
- [nitpick] Consider renaming 'madpath' to a more descriptive name like 'modelsDataPath' for improved clarity.
madpath = os.path.join(gitroot, "misc/scripts/models-as-data/")
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml:6
- [nitpick] Ensure consistent naming for model identifiers; consider using 'Models' instead of 'models' to align with other summary annotations.
- [ "models", "ManuallyModelled", False, "hasSummary", "(void *)", "", "Argument[0]", "ReturnValue", "value", "manual"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very very nice! Well done @MathiasVP !
import generate_flow_model as model | ||
|
||
language = "cpp" | ||
model.Generator.make(language).run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my intention to change the python script such that --with-summaries
uses the mixed query instead (and correspondingly for the neutral), but for testing purposes it is nice to keep both the content based and heuristic based queries around.
f.isStatic() | ||
} | ||
|
||
predicate isUninterestingForDataFlowModels(Callable api) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider to move the content of this predicate to the relevant
predicate instead.
This will make things easier in case you intend to introduce "lifting" logic for the produced models.
For C# and Java we consider implementations of method "prototypes" (implementations of interface- or abstract class members) to abide to the "contract" of the interface- or abstract member - at least if the implementation is in the same codebase as the interface- or abstract member declaration.
That is, for something like
public interface I {
object M(object o);
}
public class C : I {
public object M(object o) {
return o;
}
}
we would like to "lift" the model identified for C.M
to I.M
and use this for all implementations of I.M
.
* of `c`. | ||
*/ | ||
private string isExtensible(Callable c) { | ||
if c instanceof MemberFunction then result = "true" else result = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return false
for member functions that are final or declared in a final class/struct.
i | ||
) | ||
| | ||
if params = "" then result = "()" else result = "(" + params + ")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if params = "" then result = "()" else result = "(" + params + ")" | |
"(" + params + ")" |
// uninteresting (which is good!) | ||
not api.(Function).hasDefinition() | ||
or | ||
isPrivate(api) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about protected members?
import generate_flow_model as model | ||
|
||
language = "cpp" | ||
model.Generator.make(language).run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python script puts the generated models in
lib/ext/generated
If you intend to use this location for generated models, maybe consider to extend the qlpack.yml
to also include data extensions from this location.
int* tainted; | ||
|
||
//No model as destructors are excluded from model generation. | ||
~BasicFlow() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider adding more testcases for members where we expect not to get any models. It is a good idea (as you have already done) to narrow the number of produced models to only "effectively public" endpoints as loading data extensions can be somewhat expensive (if there are 10k's+ models).
Inspiration for adding model generation summaries to DCA: https://github.com/github/codeql-dca/pull/847/ |
Now that both #19273 and #19274 are merged we can finally get to the fun part: Adding the actual implementation of model generation to C++ 🎉.
I couldn't think of a good way to structure the commits in this PR. Apologies!
I think that after this PR is merged the final step is to add the required DCA suite for model generation. However, as this isn't yet done I don't think it makes sense to run any DCA for this right now.
I've done some testing of this already: I've run the model generation on sqlite and the models look sensible for the very small subset that I've checked (if you're curious: https://gist.github.com/MathiasVP/6942f022c7a8f4e515c80ccd442ab59f).
cc @michaelnebel would you mind taking a brief look at this PR? I don't expect you to review the C++ specific parts, obviously 🙈