Skip to content

Conversation

@satwiksps
Copy link

[ENH] Replace os.path with Path for dataset directory handling (#3088)

What does this implement/fix?

This PR modernizes the dataset path-handling logic across the datasets module by replacing legacy os.path usage with pathlib.Path.

Updated files:

  • aeon/datasets/_data_loaders.py
  • aeon/datasets/_single_problem_loaders.py
  • aeon/datasets/dataset_collections.py

Summary of changes

  • Added from pathlib import Path to each updated file.
  • Replaced:
    • os.path.dirname(aeon.__file__)
    • os.path.join(...)
  • With:
    • Path(aeon.__file__).resolve().parent / "datasets"

How this solves issue #3088

The previous implementation relied on os.path.dirname(aeon.__file__), which may resolve to unexpected paths under editable installs, virtual environments, or certain packaging scenarios.

By using Path(aeon.__file__).resolve().parent, the resolved path always points to the actual installed module directory.
This stabilizes dataset directory discovery and prevents issues where datasets are loaded from or written to incorrect locations.


Testing

  • All dataset-loading tests passed (pytest).
  • Verified dataset loading, regression/classification loading, and forecasting loading manually.
  • Pre-commit hooks were run; formatting and imports were corrected automatically.

Does your contribution introduce a new dependency?

No, pathlib is part of the Python standard library.


Any other comments?

This improves cross-platform reliability and aligns the codebase with modern Python best practices.


PR checklist

  • I will use @all-contributors after merge.
  • PR title begins with [ENH].

@aeon-actions-bot aeon-actions-bot bot added datasets Datasets and data loaders enhancement New feature, improvement request or other non-bug code enhancement labels Nov 14, 2025
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ enhancement ].
I have added the following labels to this PR based on the changes made: [ datasets ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Regenerate expected results for testing
  • Push an empty commit to re-run CI checks

@satwiksps
Copy link
Author

CI Failures Summary

Hi! @MatthewMiddlehurst @TonyBagnall

The CI reports several failing jobs, but the errors seem unrelated to the changes in this PR.

Details

  • The failures occur in tests such as test_kshapes and a small numerical mismatch in a ROCKAD doctest.
  • These belong to clustering and anomaly-detection modules, which this PR does not touch.
  • All dataset-related tests pass successfully.
  • No path-handling issues or import errors appeared, and local tests for the modified files succeeded.

Context

This PR only updates dataset directory handling to use pathlib.Path, so it should not affect the modules where the failures occurred.
The failing tests look like numerical-sensitivity issues on newer Python/NumPy versions rather than something caused by this change.

Request

Could a maintainer please confirm whether these CI test failures are expected on the current Python versions?
I’m happy to adjust anything if required, but the issues seem independent of this PR.

Thank you! 🙏

@Nithurshen
Copy link

Nithurshen commented Nov 16, 2025

The CI failures is common to all the PRs. It would be fixed.

@satwiksps satwiksps closed this Nov 17, 2025
@satwiksps satwiksps deleted the fix-dataset-root-3088 branch November 17, 2025 11:19
@MatthewMiddlehurst
Copy link
Member

Was there an issue with this PR?

@satwiksps
Copy link
Author

Was there an issue with this PR?

Hi, thanks for following up @MatthewMiddlehurst

I closed the previous PR because, after reviewing it again, I realised that the change I made (replacing os.path with Path) did not actually fix the underlying issue raised in #3088
It only modernised the path code but did not change the dataset storage location.

The real problem is that downloaded TSC datasets are currently saved inside the package directory (site-packages/aeon/datasets/data/), which is not ideal because package directories should be treated as read-only and can cause permission issues across environments.

For the proper fix, I’m planning to move the default dataset storage location to a user-writable directory using platformdirs.
On Windows this resolves to something like - C:\Users\satwiksps\AppData\Local\aeon\aeon-toolkit\datasets\ which avoids polluting the installation directory.

Before I start the new PR, could you please confirm that user_data_dir("aeon", "aeon-toolkit") / "datasets" is the preferred default location for TSC dataset downloads ?

I’ll proceed as soon as I have your confirmation.

@MatthewMiddlehurst
Copy link
Member

It would be better to edit your current PR rather than create a new one for changes like this. Rather than AppData it should probably be wherever the code is run from right? i.e. if you run code/my_code.py it should create code/aeon_data/my_data.ts or something. Up for discussion.

If you don't mind answering, why did you initially only change the method to find the path? Just curious whether it was an problem with the issue.

@satwiksps
Copy link
Author

It would be better to edit your current PR rather than create a new one for changes like this. Rather than AppData it should probably be wherever the code is run from right? i.e. if you run code/my_code.py it should create code/aeon_data/my_data.ts or something. Up for discussion.

I’m happy to continue with the existing PR instead of opening a new one. I closed it only because I thought the fix was incomplete and possibly misleading.

About the default location:
I’m open to storing the data relative to the working directory (e.g. ./aeon_data/...) as you suggested. My initial thought was to use a user-writable platform directory (e.g. AppData/Local/aeon/... on Windows, ~/.local/share/aeon/... on Linux) since many ML libraries (e.g., HuggingFace, torchtext) follow that pattern, but using a project-local folder like ./aeon_data also has advantages for reproducibility and portability.

I’m happy to implement this and I’ll update the PR accordingly.

@satwiksps
Copy link
Author

If you don't mind answering, why did you initially only change the method to find the path? Just curious whether it was an problem with the issue.

Initially, I interpreted the issue as primarily a problem with how the dataset directory was being resolved, not where it should live. The use of os.path.dirname(aeon.__file__) sometimes resolves to unexpected paths under editable instals, venvs, or when the package is symlinked.

So my first instinct was to fix the path resolution method to reduce those inconsistencies. Only after digging deeper into the datasets module did I realize the real problem is not about resolution, but about location.

So my first PR only improved the mechanics of path construction, not the logic of choosing the correct base directory. Once I connected that the issue is actually about selecting a new default storage root, I realized the initial fix wasn’t addressing the underlying problem so I stepped back to rethink the design.

@satwiksps satwiksps restored the fix-dataset-root-3088 branch November 17, 2025 12:43
@satwiksps satwiksps reopened this Nov 17, 2025
@MatthewMiddlehurst
Copy link
Member

Interesting to hear that other packages put it in AppData. May be worth asking on Slack if there is a preference? I think for now if you want to edit the PR just do it locally as its an improvement over the current method, and we can change it later if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasets Datasets and data loaders enhancement New feature, improvement request or other non-bug code enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants