[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In a previous PR I tried to address a memory leak originating in
the pystac/pystac_client libraries. We use these libraries to retrieve
data from Planetary Computer (and earthdaily, actually), which implements
the STAC standard.
As discussed in that PR, the memory "leak" is due to the library maintaining
a huge object graph/cache to aid in catalog and collection traversal over the lifetime
of a process. You can't opt out of this or bound it. It's also something we
don't need. My workaround was to kill and restart the client periodically,
and try to wipe all its entangled objects so they can be garbage collected.
This seems to work, but I have a few thoughts:
to be able to clear those circular object graphs.
easily spawn tons more worker processes -- IF we can handle the memory footprint.
On (3) in particular-- we are only doing some very simple searches to retrieve
items and assets, meaning the thousands and thousands of lines of catalog
traversal code (and data structures) go unused. All the endpoints we hit are STAC 1.0,
meaning all the complicated normalization + data migration code that is implemented
in pystac item deserializers is unnecessary. As far as the in-memory cache goes --
our data source classes are actually caching the data they need to disk and short-circuiting
calls through to the client anyway!
All that's really left is a pretty simple HTTP client that can handle structured search
over paginated results, and then deserialize the responses into something familiar
our data sources can deal with.
This PR is just a draft. I asked Cursor to lift the salient parts out of the pystac
client and reduce to the bare essentials. I'm wondering what we think about getting
rid of the dependency on pystac/pystac-client and just maintaining this (relatively)
small utility file of our own. It does only what we need to do, doesn't have the memleak,
would be easy for us to instrument ourselves, etc. There's obviously a maintenance
burden here that we outsource with the use of a 3rd party client lib, but I'm offering
this as food for thought.