Skip to content

Save files in a directory tree of bounded degree#171

Merged
kralka merged 5 commits intogoogle:mainfrom
kralka:tree_directory_structure
May 23, 2025
Merged

Save files in a directory tree of bounded degree#171
kralka merged 5 commits intogoogle:mainfrom
kralka:tree_directory_structure

Conversation

@kralka
Copy link
Copy Markdown
Collaborator

@kralka kralka commented May 22, 2025

Limit the number of files in a directory and the number of subdirectories of a directory during random file path generation.

Limit the number of files in a directory and the number of
subdirectories during random file path generation.
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 15191406455

Details

  • 74 of 74 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.4%) to 87.146%

Totals Coverage Status
Change from base Build 15104760308: 0.4%
Covered Lines: 2434
Relevant Lines: 2793

💛 - Coveralls

@kralka kralka requested a review from jmichelp May 22, 2025 17:26
Comment thread tests/io/test_file_info.py Outdated
# No exception in the top level
for _ in range((max_branching**levels) + 10):
n = generator.get_path()
assert n.count("/") == levels - 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test pass on Windows or do you need to change that for os.pathsep?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[obsolete, generating Path]

Yes, tests passed (https://github.com/google/sedpack/blob/main/.github/workflows/pytest.yml#L20C18-L20C24). Pathlib string to path should work with forward slashes on all platforms, right?

Comment thread src/sedpack/io/file_info.py Outdated
Comment thread tests/io/test_file_info.py Outdated
assert all(len(part) == name_length for part in p.parts[:-1])

# Enforce format: name/name/name/long_name
for l in range(1, levels):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a duplicate from the assert above?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[obsolete, generating Path]

This was useful later for checking number of subdirectories.

@kralka kralka requested a review from jmichelp May 23, 2025 09:06
jmichelp
jmichelp previously approved these changes May 23, 2025
Comment thread tests/io/test_file_info.py Outdated
name_length=name_length,
)

seen_paths: set[str] = set()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set[Path] actually :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thank you

@kralka kralka enabled auto-merge May 23, 2025 09:50
@kralka kralka added this pull request to the merge queue May 23, 2025
Merged via the queue into google:main with commit 7b1e661 May 23, 2025
58 checks passed
@kralka kralka deleted the tree_directory_structure branch May 23, 2025 18:23
wsxrdv added a commit to wsxrdv/sedpack that referenced this pull request Aug 25, 2025
Pull request google#171 introduced a
problem with benchmarking code rendering it useless. This commit
traverses the dataset directory recursively and thus again introducec
meaningful benchmarking. Also it makes sure that we traverse the
expected number of shards.

This commit includes also test and holdout since when we have that data
we might use it. This will introduce a regression compared to the first
benchmarks and definitely one compared to the empty benchmarks.
@wsxrdv wsxrdv mentioned this pull request Aug 25, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Aug 25, 2025
Pull request #171 introduced a
problem with benchmarking code rendering it useless. This commit
traverses the dataset directory recursively and thus again introduces
meaningful benchmarking. Also it makes sure that we traverse the
expected number of shards.

This commit includes also test and holdout splits to benchmarking since
when we have that data we might use it. This will introduce a regression
compared to the first benchmarks and definitely one compared to the
empty benchmarks.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Challenge-Cyber pushed a commit to Challenge-Cyber/zhu that referenced this pull request Sep 29, 2025
Pull request google/sedpack#171 introduced a
problem with benchmarking code rendering it useless. This commit
traverses the dataset directory recursively and thus again introduces
meaningful benchmarking. Also it makes sure that we traverse the
expected number of shards.

This commit includes also test and holdout splits to benchmarking since
when we have that data we might use it. This will introduce a regression
compared to the first benchmarks and definitely one compared to the
empty benchmarks.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants