Skip to content

Python: Extract files in hidden dirs by default #19424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tausbn
Copy link
Contributor

@tausbn tausbn commented Apr 30, 2025

Changes the default behaviour of the Python extractor so that files inside hidden directories are extracted by default.

Also adds an extractor option, skip_hidden_directories, which can be set to true in order to revert to the old behaviour.

Finally, I made the logic surrounding what is logged in various cases a bit more obvious.

Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as (excluded), but I think this makes more sense anyway.

@tausbn tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from eef5a49 to 4d00556 Compare April 30, 2025 12:40
Changes the default behaviour of the Python extractor so files inside
hidden directories are extracted by default.

Also adds an extractor option, `skip_hidden_directories`, which can be
set to `true` in order to revert to the old behaviour.

Finally, I made the logic surrounding what is logged in various cases a
bit more obvious.

Technically this changes the behaviour of the extractor (in that hidden
excluded files will now be logged as `(excluded)`, but I think this
makes more sense anyway.
@tausbn tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from 4d00556 to 5cf3242 Compare May 2, 2025 12:45
@tausbn tausbn changed the title Python: Extract hidden files/dirs by default Python: Extract files in hidden dirs by default May 2, 2025
@tausbn tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from c3ac796 to be88e61 Compare May 2, 2025 14:01
@tausbn tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from be88e61 to 2ded42c Compare May 2, 2025 14:27
@tausbn tausbn marked this pull request as ready for review May 2, 2025 14:28
@Copilot Copilot AI review requested due to automatic review settings May 2, 2025 14:28
@tausbn tausbn requested a review from a team as a code owner May 2, 2025 14:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the Python extractor to include files in hidden directories by default, adds an option to revert to the old behavior, and makes the logging logic more explicit.

  • Default extraction now includes files in hidden directories.
  • New skip_hidden_directories extractor option to restore previous behavior.
  • Refactored logging branches and updated tests, change notes, and schema.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/ql/test/extractor-tests/filter-option/Test.expected Added expected line for hidden_foo.py script.
python/ql/test/2/extractor-tests/hidden/test.expected Updated output to include files under hidden directories.
python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md Documented new default behavior and skip_hidden_directories option.
python/extractor/semmle/traverser.py Reordered exclusion checks, added env‑var‑driven is_hidden override.
python/extractor/cli-integration-test/hidden-files/test.sh Added CLI integration tests for default and skipped hidden behavior.
python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py Base script for the CLI test.
python/extractor/cli-integration-test/hidden-files/query.ql Query to list extracted file names.
python/extractor/cli-integration-test/hidden-files/query-default.expected Expected output when hidden files are included.
python/extractor/cli-integration-test/hidden-files/query-skipped.expected Expected output when skip_hidden_directories=true.
python/codeql-extractor.yml Added skip_hidden_directories option to extractor schema.
Comments suppressed due to low confidence (3)

python/codeql-extractor.yml:47

  • [nitpick] Consider defining skip_hidden_directories as a boolean type instead of a string with a pattern, to align with typical boolean option conventions.
skip_hidden_directories:

python/extractor/semmle/traverser.py:103

  • There is no test for the Windows-specific is_hidden branch when skip_hidden_directories=true on os.name == 'nt'. Consider adding a Windows-mode test to cover that path.
if os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_SKIP_HIDDEN_DIRECTORIES", "false") == "false":

python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py:1

  • The CLI integration test expectations reference .hidden_file.py and visible_file_in_hidden_dir.py, but those files are not present in repo_dir. Add these files to match query-default.expected and query-skipped.expected.
print(42)

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing. I would have preferred a more direct structure; something like

  • is_hidden stays as is
  • define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files
  • use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

@tausbn
Copy link
Contributor Author

tausbn commented May 2, 2025

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing.

That's a fair point. In fact, originally my intention was to have the "do we skip hidden dirs or not" logic be located inside the _treewalk function (hence why I decided to refactor this part of the code to only make one call to is_hidden). But then I realised that is_hidden was already a function that we were defining ourselves, and so it seemed less intrusive to just add a third definition for that function.

Really, what I think we should do is get rid of is_hidden entirely, and I'm beginning to wonder if that's not an even better solution overall. If you want to exclude hidden directories, then you should exclude them using the existing mechanisms. I need to double-check the behaviour, but I would hope that excluding **/.*/** or something along those lines would do the trick. (Although the double ** is a bit weird, and for Windows it won't match the current behaviour.)

To me, at least, it seems like a much better default behaviour to say "we extract all Python files" rather than have weird heuristics for which ones to extract or not.

I would have preferred a more direct structure; something like

  • is_hidden stays as is
  • define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files
  • use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

I feel abstracting the exclusion logic into an is_excluded function would make the code less readable, and I feel like having this function return a reason would make the ergonomics a bit awkward (unless we decide to use the empty string as False, which would work but feels a bit icky).

Alternatively, how would you feel if I renamed is_hidden to is_skipped? Currently we only ever skip hidden directories, so this would have the same result. The reason logged (with debugging enabled) would need to change to (skipped) for consistency, but I don't think this would cause any issues.


I'm going to test if excluding hidden directories can be done using the existing mechanism for file filtering. If it does, then I'll just get rid of is_hidden entirely and rewrite the change note to highlight this new approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants