Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importer is becoming slower when geoserver catalog becomes big #285

Open
pchevali opened this issue Feb 19, 2025 · 1 comment
Open

Importer is becoming slower when geoserver catalog becomes big #285

pchevali opened this issue Feb 19, 2025 · 1 comment

Comments

@pchevali
Copy link
Contributor

Hi,

I know that this importer is being integrated in geonode itself but i think the problem is the same, I noticed that the import process is getting slower and slower the more data there is in the geoserver catalog (with my setup, roughly 1000 layers results in 60seconds per layer import, no SSD).
From the geoserver log I see that there is a request made on each stores in the workspace of geoserver.

I think that this is due to the sanity_check phase of the publisher, as it makes request over all stores of geoserver with all possible names to check for srid.

https://github.com/GeoNode/geonode-importer/blob/master/importer/publisher.py#L75

As the publish_resources makes a call to geoserver-restconfig function : create_coveragestore, which returns the created resource, maybe we can avoid the sanity check over all the stores ?

https://github.com/GeoNode/geoserver-restconfig/blob/ffbcbb175e9df37dbbd4bf0240058d79fb92eca0/src/geoserver/catalog.py#L689

@pchevali
Copy link
Contributor Author

I tested by removing the sanity checks, the time gets divided by 2. But I still have in geoserver logs a request over every stores...

I tracked it down to the create_geonode_resource, which calls the (geonode)resource_manager.create -> (geonode)sync_instance_with_geoserver -> (geonode)fetch_gs_resource->(geoserver-restconfig)get_resource-> (geoserver-restconfig)get_resources

The get_resources function is costly when there is no store given, as it loops over each stores of geoserver. In my case I have 1000 raster images, then there is at least 1000 request each time the get_resources function is called without a store

So when uploading a raster there are 3 calls to geoserver-restconfig get_resources() without a store given.
I think that the second one is cached, thus we have effectively 2 loops over all the stores.

The geoserver-restconfig makes the get_resources call without giving a store parameter, although it has it in parameter

My path to faster import would be:

  1. Modify geoserver-restconfig create_coveragestore to call the get_resources with a store

  2. Use the store for Sanitychecks in geonode-importer, here there is already the function get_geoserver_store_name taht returns the store name based on resource name, so we can use it.

  3. Create the geonode resource in create_geonode_resource with the store as a parameter to the default dict:
    https://github.com/GeoNode/geonode-importer/blob/master/importer/handlers/common/raster.py#L338

Am I missing something here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant