Skip to content

Add metadata only stage to repodata generation#284

Open
soapy1 wants to merge 23 commits into
conda:mainfrom
soapy1:cache-keep-around
Open

Add metadata only stage to repodata generation#284
soapy1 wants to merge 23 commits into
conda:mainfrom
soapy1:cache-keep-around

Conversation

@soapy1
Copy link
Copy Markdown

@soapy1 soapy1 commented Apr 16, 2026

Description

This introduces a new upstream stage called md. The existing stages in conda-index are:

  • fs - means that the artifact is now available in the set of packages and is assumed by default to be the local filesystem
  • indexed - means that the entry already exists in the database (same filename, same timestamp, same hash), and its package metadata has been extracted to the index_json etc

This pr adds md which is like fs in that it represents an aritfact that is now available. But, it is assumed to not be available on the local filesystem. md is a sibiling to the indexed stage. This is helpful for indexing artifacts that are not able to be represented on the local filesystem. For example, to represent pypi packages.

To include metadata sourced packages in repodata be use to include include_stages and package_extensions to the ChannelIndex.cache_kwargs during it's instantiation. For example:

channel_index = ChannelIndex(
        tmp_path,
        "haswheels",  # channel name if different than last segment of tmp_path
        repodata_v3=True,
        cache_kwargs={
            "package_extensions": CONDA_PACKAGE_EXTENSIONS + (".whl",),
            "include_stages": ["md"],
        },
    )

And inject the metadata using the store_md_state function:

cache = channel_index.cache_for_subdir("noarch")
. . .
cache.store_md_state(listdir_like())

See tests/test_demonstrate_wheel.py for how this changes the api for injecting wheel data into repodata.

ref: conda/conda-pypi#276 (comment)

Other noteable changes

  • fixed import errors when importing from conda_index.postgres.cache import PsqlCache
    • this caused test failures in tests/test_psql.py - looks like these tests have been getting skipped
    • fixed those tests
  • update conda_index.conda_index.cache.BaseCondaIndexCache.database_path to only prepend the database prefix if the prefix is not already included

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@github-project-automation github-project-automation Bot moved this to 🆕 New in 🔎 Review Apr 16, 2026
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Apr 16, 2026
@dholth dholth self-requested a review April 17, 2026 12:18
Comment thread conda_index/index/__init__.py Outdated
Comment thread conda_index/index/__init__.py Outdated
self.created_at = now_dt.strftime("%Y-%m-%dT%H:%M:%SZ")

def cache_for_subdir(self, subdir):
def cache_for_subdir(self, subdir, stage: str | None = None):
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different approach might make the cache layer know about what all the desired upstream stages are instead of getting a cache by subdir and stage.
By moving the responsibility of keeping track of the upstream stages to the cache layer, we can avoid the merging logic below. This could be replaced by accounting for the multiple stages in the query phase.

Copy link
Copy Markdown
Contributor

@dholth dholth Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database layer is the right place to merge stages. It could be accounted for in cache_kwargs when creating ChannelIndex, or in indexed_packages() when performing the query.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go for including this in the cache_kwargs. We'll define a new attribute include_stages which accepts a list of stages to consider in addition to upstream_stage.

@soapy1 soapy1 force-pushed the cache-keep-around branch from e095641 to cd05ea6 Compare April 17, 2026 21:18
Comment thread conda_index/index/cache.py Outdated
Comment thread conda_index/index/sqlitecache.py Outdated
Comment thread conda_index/index/sqlitecache.py Outdated
WITH
fs AS
( SELECT path, mtime, size, sha256, md5 FROM stat WHERE stage = :upstream_stage ),
( SELECT path, mtime, size, sha256, md5 FROM stat WHERE stage IN ({stages_placeholders}) ),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to check what happens when path is in fs and md at the same time. We would get the same index_json twice, maybe, but it would also overwrite itself in the output dict.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! Added a test in tests/test_index.py::test_index_noarch_with_wheels. This will put wheels and noarch conda packages in the same noarch subdir.

@dholth
Copy link
Copy Markdown
Contributor

dholth commented Apr 23, 2026

Do you think md is a sibling to the indexed phase, not the upstream fs phase? We wouldn't join md with indexed to find a list of packages whose metadata needs to be added; instead; md implies that the metadata has been loaded in the index_json table. Then we only consider md+indexed when pulling data out of the database.

@soapy1 soapy1 self-assigned this Apr 27, 2026
@soapy1
Copy link
Copy Markdown
Author

soapy1 commented Apr 27, 2026

hmmm, I was thinking of md and fs as siblings. But I think you are right, it should be md and indexed. Will re-jig it!

@soapy1 soapy1 force-pushed the cache-keep-around branch from bf866b1 to 55367f2 Compare May 4, 2026 17:47
@soapy1 soapy1 changed the title experiment: Add metadata only stage to repodata generation Add metadata only stage to repodata generation May 4, 2026
Comment thread conda_index/index/sqlitecache.py Outdated
@soapy1 soapy1 force-pushed the cache-keep-around branch 4 times, most recently from d243c9f to 88fa138 Compare May 5, 2026 18:11
@soapy1 soapy1 force-pushed the cache-keep-around branch from 88fa138 to c538a50 Compare May 5, 2026 18:12
@soapy1 soapy1 force-pushed the cache-keep-around branch from 802ef0d to 1c01b91 Compare May 5, 2026 20:28
stat["size"],
stat["mtime"],
{},
stat["repodata"],
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that when a user is creating a function like listdir_like they need to also add this repodata key. That isn't very listdir_stat like.....

For example, something like

def listdir_like():
        for path, repodata in wheels.items():
            assert "sha256" in repodata
            if "md5" not in repodata:
                repodata["md5"] = None
            yield {
                "path": cache.database_path(path),
                "size": repodata["size"],
                "mtime": repodata.get("timestamp", 1), 
                "repodata": repodata,
            }

This is maybe a bit too much of a hack. @dholth what do you think?

@soapy1 soapy1 force-pushed the cache-keep-around branch from 37d0d0a to d72e5fe Compare May 5, 2026 20:44
Comment thread pyproject.toml Outdated
"msgpack",
"psycopg2",
"ruamel.yaml",
"sqlalchemy",
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I expect these dependencies were not included for some reason. After fixing the psql cache import errors in the tests, I got new errors saying that these dependencies are missing. Is there a strong incentive to keep these out of the main package? An alternative is to add them as optional dependencies.

@dholth
Copy link
Copy Markdown
Contributor

dholth commented May 5, 2026

This program is written to work with "low dependencies", so it's necessary for sqlalchemy to be only an optional dependency for conda-index. This should help users who install conda-index alongside other programs or incidentally as a conda, conda-pypi, or conda-build dependency.

@soapy1 soapy1 force-pushed the cache-keep-around branch from b55eae9 to bdd5b90 Compare May 5, 2026 21:14
Comment thread tests/conftest.py Outdated
@soapy1 soapy1 force-pushed the cache-keep-around branch 3 times, most recently from 5fb569e to 0a51a31 Compare May 5, 2026 22:10
@soapy1 soapy1 force-pushed the cache-keep-around branch from 0a51a31 to 365ee48 Compare May 5, 2026 22:19
@soapy1 soapy1 marked this pull request as ready for review May 5, 2026 23:39
@soapy1 soapy1 moved this from In Progress 🏗️ to In review 🔍 in conda Roadmap and Sprint Planning May 5, 2026
@soapy1 soapy1 requested a review from dholth May 20, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed [bot] added once the contributor has signed the CLA

Projects

Status: In review 🔍
Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

4 participants