refactor: Introduce new Apify storage client #470

vdusek · 2025-05-10T09:22:35Z

Description

Integration of the Crawlee v1 changes, mostly new storages & storage clients (introduced in refactor!: Introduce new storage client system crawlee-python#1194).

Issues

Closes: Introduce new Apify storage client #469

Testing

The current test set covers the changes.

src/apify/apify_storage_client/_dataset_client.py

src/apify/apify_storage_client/__init__.py

janbuchar · 2025-06-26T14:19:49Z

docs/03_concepts/code/03_dataset_exports.py

@@ -11,14 +11,14 @@ async def main() -> None:
        await dataset.export_to(
            content_type='csv',
            key='data.csv',
-            to_key_value_store_name='my-cool-key-value-store',
+            to_kvs_name='my-cool-key-value-store',


Is this BC break worth it?

let's evaluate all the potential BCs at the end

Sure. I thought we are nearing that now 😁

Since we're just re-exporting the storages from crawlee here, there will be many more cases than this one. I'm not saying we have to rename this particular argument (and I will undo it if you insist—just I don't like those long identifiers, especially when we can use the KVS abbreviation).

src/apify/storage_clients/_apify/_dataset_client.py

src/apify/_actor.py

tests/integration/conftest.py

vdusek · 2025-07-02T13:21:26Z

Probably let's wait for @janbuchar, and also I'll add an upgrading guide for v2.

janbuchar

I haven't made it through the request queue client, but here's a batch of comments.

janbuchar · 2025-07-15T20:15:07Z

src/apify/storage_clients/_file_system/_key_value_store_client.py

+    """
+
+    @override
+    async def purge(self) -> None:


Does this override mean that Crawlee does not treat INPUT.json differently from other files anymore and this logic is only added in the SDK? I'm a fan, but I wonder if the rest of the team is on board with this.

Exactly - because Crawlee does not use INPUT.json for anything. We are in the same boat with @Pijukatel here.

@B4nan, would you be OK with this change? TLDR; so you don't have to study the code: As Honza wrote, Crawlee now treats INPUT.json just like any other file. Here in SDK, I've added a minimal override of FileSystemStorageClient that only customizes the purge method, preventing it from deleting INPUT.json. That's it.

are we talking about the input.json in the project root? since that is implemented on crawlee level (in JS) and i think it should stay this way. i am not talking about the purging logic, in JS, if there is an INPUT.json file in the root, KVS.getInput() will prefer it over anything in the storage folder (the local/memory implementation, with apify client it is ignored completely). not sure if we even do this in the python version (and I am open to discuss if we want to remove this from v4, but it feels like a good thing to me)

if this is just about purging, i am not entirely sure i understand what it means, we dont want to autopurge inputs on crawlee level, crawlee shouldnt behave differently because you init the SDK (not locally).

this is the part where we handle input file in the project root in JS

https://github.com/apify/crawlee/blob/5c4726df96e358a9bbf44a0cd2760e4e269f0fae/packages/core/src/storages/key_value_store.ts#L697

From the "Crawlee is a web scraping framework" point of view, handling input feels a bit redundant here - if you intend to only run the crawler outside of Apify, you'll probably want to accept input some other way - CLI arguments, environment variables, some domain-speciffic config file... Providing a helper for loading a JSON file from two possible locations doesn't add much value IMO.

Asking Apifiers about this doesn't make much sense in this situation, because basically everything we write is intended to run on Apify 🙂

src/apify/storage_clients/_apify/_key_value_store_client.py

Pijukatel

While testing the change on platform I used modified benchmark tests. They are specific, because they scrape locally hosted site and so the speed of scraping is unusually fast - no network related delays. This exposes RequestQueue race condition that can happen when checking if the RQ is empty or not.
When running the test using this branch, I saw that the crawler premature finished even though RQ was still quite full. The hypothesis about premature exit was confirmed when running same test on branch modified with extra log and sleep.

Please see following tiny commit based on this branch and attached log:
5b25728

From this run: https://console.apify.com/actors/UdJ0NQp2j8140G9db/runs/4yfZqLvLo1xhrk2Cb#log

src/apify/storage_clients/_apify/_request_queue_client.py

vdusek self-assigned this May 10, 2025

github-actions bot added this to the 114th sprint - Tooling team milestone May 10, 2025

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 10, 2025

vdusek changed the title ~~New apify storage clients~~ refactor: Introduce new Apify storage client May 10, 2025

vdusek mentioned this pull request May 10, 2025

refactor!: Introduce new storage client system apify/crawlee-python#1194

Merged

1 task

janbuchar reviewed May 16, 2025

View reviewed changes

src/apify/apify_storage_client/_dataset_client.py Outdated Show resolved Hide resolved

vdusek force-pushed the new-apify-storage-clients branch from d27c080 to 82efd3e Compare June 12, 2025 12:44

github-actions bot added the tested Temporary label used only programatically for some analytics. label Jun 18, 2025

vdusek force-pushed the new-apify-storage-clients branch 2 times, most recently from 067b793 to 104a168 Compare June 23, 2025 09:12

vdusek added 6 commits June 26, 2025 08:22

Rm old Apify storage clients

5c437c9

Add init version of new Apify storage clients

bf55338

Move specific models from Crawlee to SDK

6b2f82b

Adapt to Crawlee v1

38bef68

Adapt to Crawlee v1 (p2)

1f85430

Fix default storage IDs

a3d68a2

vdusek force-pushed the new-apify-storage-clients branch from dc7f0a7 to a3d68a2 Compare June 26, 2025 06:25

vdusek modified the milestones: 114th sprint - Tooling team, 117th sprint - Tooling team Jun 26, 2025

vdusek added 3 commits June 26, 2025 10:42

Fix integration test and Not implemented exception in purge

c77e8d5

Fix unit tests

8731aff

fix lint

8dfaffb

vdusek marked this pull request as ready for review June 26, 2025 13:04

vdusek requested a review from Pijukatel June 26, 2025 13:05

janbuchar self-requested a review June 26, 2025 13:27

vdusek added 2 commits June 26, 2025 16:40

add KVS record_exists not implemented

53fad07

update to apify client 1.12 and implement record exists

5869f8e

janbuchar reviewed Jun 26, 2025

View reviewed changes

vdusek added 2 commits June 27, 2025 08:43

Move default storage IDs to Configuration

82e65fc

opening storages get default id from config

8de950b

vdusek requested a review from Pijukatel July 2, 2025 10:54

Pijukatel approved these changes Jul 2, 2025

View reviewed changes

vdusek and others added 7 commits July 3, 2025 16:08

Merge branch 'master' into new-apify-storage-clients

6fe9eb3

fixes after merge commit

1547cbd

Merge branch 'master' into new-apify-storage-clients

bb47efc

Change from orphan commit to master in crawlee version

4e4fa93

Merge branch 'master' into new-apify-storage-clients

683cb31

fix encrypted secrets test

e5b2bc4

Add Apify's version of FS client that keeps the INPUT json

638756f

janbuchar reviewed Jul 15, 2025

View reviewed changes

vdusek added 4 commits July 16, 2025 09:31

update metadata fixes

931b0ca

Merge branch 'master' into new-apify-storage-clients

ad7c0d8

KVS metadata extended model

1f3c481

fix url signing secret key

44d8e09

Pijukatel self-requested a review July 17, 2025 13:23

Pijukatel requested changes Jul 17, 2025

View reviewed changes

Apify storage client fixes and new docs groups

ca72313

vdusek force-pushed the new-apify-storage-clients branch from 2f39b35 to ca72313 Compare July 20, 2025 18:49

Pijukatel and others added 4 commits July 21, 2025 12:37

Add test for RequestQueue.is_finished

bc61fee

Check _queue_has_locked_requests in is_empty

16b76dd

Merge branch 'master' into new-apify-storage-clients

b6e8a5f

Package structure update

a3f8c6e

Pijukatel reviewed Jul 22, 2025

View reviewed changes

src/apify/storage_clients/_apify/_request_queue_client.py Outdated Show resolved Hide resolved

Pijukatel self-requested a review July 22, 2025 13:45

Pijukatel reviewed Jul 22, 2025

View reviewed changes

src/apify/storage_clients/_apify/_request_queue_client.py Outdated Show resolved Hide resolved

vdusek added 2 commits July 22, 2025 16:27

Fix request list (HttpResponse.read is now async)

594a8e5

init upgrading guide to v3

e1afe2d

Pijukatel self-requested a review July 24, 2025 14:43

addres RQ feedback from Pepa

8ce6902

refactor: Introduce new Apify storage client #470

Are you sure you want to change the base?

refactor: Introduce new Apify storage client #470

Conversation

vdusek commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek commented Jul 2, 2025

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pijukatel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek commented May 10, 2025 •

edited

Loading

Pijukatel left a comment •

edited

Loading