[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339

cmwilhelm · 2025-10-18T18:04:18Z

In a previous PR I tried to address a memory leak originating in
the pystac/pystac_client libraries. We use these libraries to retrieve
data from Planetary Computer (and earthdaily, actually), which implements
the STAC standard.

As discussed in that PR, the memory "leak" is due to the library maintaining
a huge object graph/cache to aid in catalog and collection traversal over the lifetime
of a process. You can't opt out of this or bound it. It's also something we
don't need. My workaround was to kill and restart the client periodically,
and try to wipe all its entangled objects so they can be garbage collected.

This seems to work, but I have a few thoughts:

The memory footprint is still pretty big because the GC takes a while
to be able to clear those circular object graphs.
Our dataset build nodes in Run are super underutilizing CPU. We could
easily spawn tons more worker processes -- IF we can handle the memory footprint.
The way we use the pystac client is EXTREMELY light.

On (3) in particular-- we are only doing some very simple searches to retrieve
items and assets, meaning the thousands and thousands of lines of catalog
traversal code (and data structures) go unused. All the endpoints we hit are STAC 1.0,
meaning all the complicated normalization + data migration code that is implemented
in pystac item deserializers is unnecessary. As far as the in-memory cache goes --
our data source classes are actually caching the data they need to disk and short-circuiting
calls through to the client anyway!

All that's really left is a pretty simple HTTP client that can handle structured search
over paginated results, and then deserialize the responses into something familiar
our data sources can deal with.

This PR is just a draft. I asked Cursor to lift the salient parts out of the pystac
client and reduce to the bare essentials. I'm wondering what we think about getting
rid of the dependency on pystac/pystac-client and just maintaining this (relatively)
small utility file of our own. It does only what we need to do, doesn't have the memleak,
would be easy for us to instrument ourselves, etc. There's obviously a maintenance
burden here that we outsource with the use of a 3rd party client lib, but I'm offering
this as food for thought.

hunterp · 2025-10-18T19:49:48Z

FWIW i would be interested in having Cursor/Claude take a pass at the inverse of this: updating the pystac library to either let us set a maxsize of that cache or let the user configure the caching behavior. As you said pystac is kind of the only game in town, but we're somewhat in that game now. We (the ES team) knows most of the other players in this open source geospatial space now, and they are keen to work with us. I'd be very interested in us actually being opinionated in the space vs just building our own custom stuff here.

I'm not opposed to this approach, but I'd like to at least explore the other way.

cmwilhelm · 2025-10-18T22:28:06Z

I’ve simplified it by calling it a cache. It’s that plus an object graph where all the nodes in the cache know about each other, and, I suspect, things outside the cache. From what I’ve seen, the cache is just an easy access point to the object graph rather than the only one. These sticky circular references, which are pretty core to the library design (all deserialized objects self register themselves), makes garbage collection super rough. I’m probably not going to look at this in earnest until Monday but I think it would PROBABLY be tractable to completely disable caching/graph registration. It would probably really hard to bound its growth (and make the ejected components actually collectible). We could look at alternative caching— perhaps lruc on raw dicts by id rather than a materialized graph — but that would be a larger departure from library design.

…

On Sat, Oct 18, 2025 at 12:50 PM hunterp ***@***.***> wrote: *hunterp* left a comment (allenai/rslearn#339) <#339 (comment)> FWIW i would be interested in having Cursor/Claude take a pass at the inverse of this: updating the pystac library to either let us set a maxsize of that cache or let the user configure the caching behavior. As you said pystac is kind of the only game in town, but we're somewhat in that game now. We (the ES team) knows most of the other players in this open source geospatial space now, and they are keen to work with us. I'd be very interested in us actually being opinionated in the space vs just building our own custom stuff here. I'm not opposed to this approach, but I'd like to at least explore the other way. — Reply to this email directly, view it on GitHub <#339 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ2HDUCDJBCHSHWXUZ732T3YKKXHAVCNFSM6AAAAACJSB2VKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJYG43DQOJQGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

draft pystac replacement

e9b3e02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339

[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339

Uh oh!

cmwilhelm commented Oct 18, 2025

Uh oh!

hunterp commented Oct 18, 2025

Uh oh!

cmwilhelm commented Oct 18, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339

Are you sure you want to change the base?

[DRAFT PROPOSAL DO NOT MERGE] draft pystac replacement #339

Uh oh!

Conversation

cmwilhelm commented Oct 18, 2025

Uh oh!

hunterp commented Oct 18, 2025

Uh oh!

cmwilhelm commented Oct 18, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants