Skip to content

Conversation

@cmwilhelm
Copy link
Contributor

In a previous PR I tried to address a memory leak originating in
the pystac/pystac_client libraries. We use these libraries to retrieve
data from Planetary Computer (and earthdaily, actually), which implements
the STAC standard.

As discussed in that PR, the memory "leak" is due to the library maintaining
a huge object graph/cache to aid in catalog and collection traversal over the lifetime
of a process. You can't opt out of this or bound it. It's also something we
don't need. My workaround was to kill and restart the client periodically,
and try to wipe all its entangled objects so they can be garbage collected.

This seems to work, but I have a few thoughts:

  1. The memory footprint is still pretty big because the GC takes a while
    to be able to clear those circular object graphs.
  2. Our dataset build nodes in Run are super underutilizing CPU. We could
    easily spawn tons more worker processes -- IF we can handle the memory footprint.
  3. The way we use the pystac client is EXTREMELY light.

On (3) in particular-- we are only doing some very simple searches to retrieve
items and assets, meaning the thousands and thousands of lines of catalog
traversal code (and data structures) go unused. All the endpoints we hit are STAC 1.0,
meaning all the complicated normalization + data migration code that is implemented
in pystac item deserializers is unnecessary. As far as the in-memory cache goes --
our data source classes are actually caching the data they need to disk and short-circuiting
calls through to the client anyway!

All that's really left is a pretty simple HTTP client that can handle structured search
over paginated results, and then deserialize the responses into something familiar
our data sources can deal with.

This PR is just a draft. I asked Cursor to lift the salient parts out of the pystac
client and reduce to the bare essentials. I'm wondering what we think about getting
rid of the dependency on pystac/pystac-client and just maintaining this (relatively)
small utility file of our own. It does only what we need to do, doesn't have the memleak,
would be easy for us to instrument ourselves, etc. There's obviously a maintenance
burden here that we outsource with the use of a 3rd party client lib, but I'm offering
this as food for thought.

@hunterp
Copy link

hunterp commented Oct 18, 2025

FWIW i would be interested in having Cursor/Claude take a pass at the inverse of this: updating the pystac library to either let us set a maxsize of that cache or let the user configure the caching behavior. As you said pystac is kind of the only game in town, but we're somewhat in that game now. We (the ES team) knows most of the other players in this open source geospatial space now, and they are keen to work with us. I'd be very interested in us actually being opinionated in the space vs just building our own custom stuff here.

I'm not opposed to this approach, but I'd like to at least explore the other way.

@cmwilhelm
Copy link
Contributor Author

cmwilhelm commented Oct 18, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants