Python: Extract files in hidden dirs by default #19424

tausbn · 2025-04-30T12:38:16Z

Changes the default behaviour of the Python extractor so that files inside hidden directories are extracted by default.

Also adds an extractor option, skip_hidden_directories, which can be set to true in order to revert to the old behaviour.

Finally, I made the logic surrounding what is logged in various cases a bit more obvious.

Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as (excluded), but I think this makes more sense anyway.

Changes the default behaviour of the Python extractor so files inside hidden directories are extracted by default. Also adds an extractor option, `skip_hidden_directories`, which can be set to `true` in order to revert to the old behaviour. Finally, I made the logic surrounding what is logged in various cases a bit more obvious. Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as `(excluded)`, but I think this makes more sense anyway.

Copilot

Pull Request Overview

This PR changes the Python extractor to include files in hidden directories by default, adds an option to revert to the old behavior, and makes the logging logic more explicit.

Default extraction now includes files in hidden directories.
New skip_hidden_directories extractor option to restore previous behavior.
Refactored logging branches and updated tests, change notes, and schema.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
python/ql/test/extractor-tests/filter-option/Test.expected	Added expected line for hidden_foo.py script.
python/ql/test/2/extractor-tests/hidden/test.expected	Updated output to include files under hidden directories.
python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md	Documented new default behavior and `skip_hidden_directories` option.
python/extractor/semmle/traverser.py	Reordered exclusion checks, added env‑var‑driven `is_hidden` override.
python/extractor/cli-integration-test/hidden-files/test.sh	Added CLI integration tests for default and skipped hidden behavior.
python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py	Base script for the CLI test.
python/extractor/cli-integration-test/hidden-files/query.ql	Query to list extracted file names.
python/extractor/cli-integration-test/hidden-files/query-default.expected	Expected output when hidden files are included.
python/extractor/cli-integration-test/hidden-files/query-skipped.expected	Expected output when `skip_hidden_directories=true`.
python/codeql-extractor.yml	Added `skip_hidden_directories` option to extractor schema.

Comments suppressed due to low confidence (3)

python/codeql-extractor.yml:47

[nitpick] Consider defining skip_hidden_directories as a boolean type instead of a string with a pattern, to align with typical boolean option conventions.

skip_hidden_directories:

python/extractor/semmle/traverser.py:103

There is no test for the Windows-specific is_hidden branch when skip_hidden_directories=true on os.name == 'nt'. Consider adding a Windows-mode test to cover that path.

if os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_SKIP_HIDDEN_DIRECTORIES", "false") == "false":

python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py:1

The CLI integration test expectations reference .hidden_file.py and visible_file_in_hidden_dir.py, but those files are not present in repo_dir. Add these files to match query-default.expected and query-skipped.expected.

print(42)

yoff

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing. I would have preferred a more direct structure; something like

is_hidden stays as is
define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files
use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

tausbn · 2025-05-02T21:02:29Z

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing.

That's a fair point. In fact, originally my intention was to have the "do we skip hidden dirs or not" logic be located inside the _treewalk function (hence why I decided to refactor this part of the code to only make one call to is_hidden). But then I realised that is_hidden was already a function that we were defining ourselves, and so it seemed less intrusive to just add a third definition for that function.

Really, what I think we should do is get rid of is_hidden entirely, and I'm beginning to wonder if that's not an even better solution overall. If you want to exclude hidden directories, then you should exclude them using the existing mechanisms. I need to double-check the behaviour, but I would hope that excluding **/.*/** or something along those lines would do the trick. (Although the double ** is a bit weird, and for Windows it won't match the current behaviour.)

To me, at least, it seems like a much better default behaviour to say "we extract all Python files" rather than have weird heuristics for which ones to extract or not.

I would have preferred a more direct structure; something like

is_hidden stays as is

define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files

use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

I feel abstracting the exclusion logic into an is_excluded function would make the code less readable, and I feel like having this function return a reason would make the ergonomics a bit awkward (unless we decide to use the empty string as False, which would work but feels a bit icky).

Alternatively, how would you feel if I renamed is_hidden to is_skipped? Currently we only ever skip hidden directories, so this would have the same result. The reason logged (with debugging enabled) would need to change to (skipped) for consistency, but I don't think this would cause any issues.

I'm going to test if excluding hidden directories can be done using the existing mechanism for file filtering. If it does, then I'll just get rid of is_hidden entirely and rewrite the change note to highlight this new approach.

github-actions bot added Python documentation labels Apr 30, 2025

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from eef5a49 to 4d00556 Compare April 30, 2025 12:40

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from 4d00556 to 5cf3242 Compare May 2, 2025 12:45

tausbn changed the title ~~Python: Extract hidden files/dirs by default~~ Python: Extract files in hidden dirs by default May 2, 2025

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from c3ac796 to be88e61 Compare May 2, 2025 14:01

tausbn added 3 commits May 2, 2025 14:27

Python: Add integration test

605f2bf

Python: Add change note

67d04d5

Python: Update extractor tests

2ded42c

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from be88e61 to 2ded42c Compare May 2, 2025 14:27

tausbn marked this pull request as ready for review May 2, 2025 14:28

Copilot AI review requested due to automatic review settings May 2, 2025 14:28

tausbn requested a review from a team as a code owner May 2, 2025 14:28

Copilot AI reviewed May 2, 2025

View reviewed changes

yoff approved these changes May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Extract files in hidden dirs by default #19424

Python: Extract files in hidden dirs by default #19424

tausbn commented Apr 30, 2025 •

edited

Loading

Copilot AI left a comment

yoff left a comment

tausbn commented May 2, 2025

Python: Extract files in hidden dirs by default #19424

Are you sure you want to change the base?

Python: Extract files in hidden dirs by default #19424

Conversation

tausbn commented Apr 30, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

yoff left a comment

Choose a reason for hiding this comment

tausbn commented May 2, 2025

tausbn commented Apr 30, 2025 •

edited

Loading