Skip to content

C++: Instantiate model generation library #19295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MathiasVP
Copy link
Contributor

Now that both #19273 and #19274 are merged we can finally get to the fun part: Adding the actual implementation of model generation to C++ 🎉.

I couldn't think of a good way to structure the commits in this PR. Apologies!

  • The first commit adds the entire library
  • The second commit adds tests that will succeed at the end of the PR.
  • The third commit instantiates the inline expectation test framework to test model generation
  • The fourth commit adds all the files that I could copy/paste from C#/Java/Rust. It's basically all the query and test files required for model generation.

I think that after this PR is merged the final step is to add the required DCA suite for model generation. However, as this isn't yet done I don't think it makes sense to run any DCA for this right now.

I've done some testing of this already: I've run the model generation on sqlite and the models look sensible for the very small subset that I've checked (if you're curious: https://gist.github.com/MathiasVP/6942f022c7a8f4e515c80ccd442ab59f).

cc @michaelnebel would you mind taking a brief look at this PR? I don't expect you to review the C++ specific parts, obviously 🙈

@Copilot Copilot bot review requested due to automatic review settings April 11, 2025 18:53
@MathiasVP MathiasVP requested a review from a team as a code owner April 11, 2025 18:53
@github-actions github-actions bot added the C++ label Apr 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR instantiates the model generation library for C++ by adding the complete library, tests, and associated extension configuration files to enable model generation via query summaries.

  • Adds annotated model definitions and summaries in the library tests.
  • Introduces tests (including instantiation of the inline expectation test framework) for verifying model generation.
  • Provides extension YAML files to integrate summary models into the CodeQL pack.

Reviewed Changes

Copilot reviewed 4 out of 18 changed files in this pull request and generated no comments.

File Description
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp Adds model definitions with summary annotations for C++ dataflow.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ext.yml Configures summary capture extension with manual models.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml Configures content-based summary capture extension.
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py Introduces a utility script for generating C++ flow models.
Files not reviewed (14)
  • cpp/ql/lib/utils/test/InlineMadTest.qll: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureMixedNeutralModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureMixedSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureNeutralModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSinkModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSourceModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll: Language not supported
  • cpp/ql/src/utils/modelgenerator/internal/CaptureModelsPrinting.qll: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.expected: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.expected: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ql: Language not supported
Comments suppressed due to low confidence (2)

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py:9

  • [nitpick] Consider renaming 'madpath' to a more descriptive name like 'modelsDataPath' for improved clarity.
madpath = os.path.join(gitroot, "misc/scripts/models-as-data/")

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml:6

  • [nitpick] Ensure consistent naming for model identifiers; consider using 'Models' instead of 'models' to align with other summary annotations.
- [ "models", "ManuallyModelled", False, "hasSummary", "(void *)", "", "Argument[0]", "ReturnValue", "value", "manual"]

@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Apr 11, 2025
Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very very nice! Well done @MathiasVP !

import generate_flow_model as model

language = "cpp"
model.Generator.make(language).run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my intention to change the python script such that --with-summaries uses the mixed query instead (and correspondingly for the neutral), but for testing purposes it is nice to keep both the content based and heuristic based queries around.

f.isStatic()
}

predicate isUninterestingForDataFlowModels(Callable api) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider to move the content of this predicate to the relevant predicate instead.
This will make things easier in case you intend to introduce "lifting" logic for the produced models.
For C# and Java we consider implementations of method "prototypes" (implementations of interface- or abstract class members) to abide to the "contract" of the interface- or abstract member - at least if the implementation is in the same codebase as the interface- or abstract member declaration.
That is, for something like

public interface I { 
  object M(object o);
}

public class C : I {
  public object M(object o) {
    return o;
  }
}

we would like to "lift" the model identified for C.M to I.M and use this for all implementations of I.M.

* of `c`.
*/
private string isExtensible(Callable c) {
if c instanceof MemberFunction then result = "true" else result = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return false for member functions that are final or declared in a final class/struct.

i
)
|
if params = "" then result = "()" else result = "(" + params + ")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if params = "" then result = "()" else result = "(" + params + ")"
"(" + params + ")"

// uninteresting (which is good!)
not api.(Function).hasDefinition()
or
isPrivate(api)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about protected members?

import generate_flow_model as model

language = "cpp"
model.Generator.make(language).run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python script puts the generated models in

lib/ext/generated

If you intend to use this location for generated models, maybe consider to extend the qlpack.yml to also include data extensions from this location.

int* tainted;

//No model as destructors are excluded from model generation.
~BasicFlow() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider adding more testcases for members where we expect not to get any models. It is a good idea (as you have already done) to narrow the number of produced models to only "effectively public" endpoints as loading data extensions can be somewhat expensive (if there are 10k's+ models).

@michaelnebel
Copy link
Contributor

Inspiration for adding model generation summaries to DCA: https://github.com/github/codeql-dca/pull/847/
Not sure why the experiments failed last week. We might need help from the DX team on this (but maybe this is something for after Easter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants