diff --git a/.cursor/rules/storage-provider.mdc b/.cursor/rules/storage-provider.mdc index c49f8df39da2..193a711b0a8e 100644 --- a/.cursor/rules/storage-provider.mdc +++ b/.cursor/rules/storage-provider.mdc @@ -2,87 +2,142 @@ description: How to add a new storage or data connector for Label Studio alwaysApply: false --- -# Cursor Rule: Implementing New Storage Providers in Label Studio +# Implementing New Storage Providers in Label Studio ## Overview -This rule describes the process and best practices for adding a new storage provider to Label Studio using the declarative provider schema system. - -See comprehensive overview about storages @io_storages/README.md. - -## Architecture Overview +This document describes the process and best practices for adding a new storage provider to Label Studio using the declarative provider schema system. Label Studio supports 2 types of cloud storages: 1. **Import Storages** (Source Cloud Storages) - for importing tasks/data 2. **Export Storages** (Target Cloud Storages) - for exporting annotations -### Key Differences Between Storage Types - -| Aspect | Import Storage | Export Storage | -|--------|----------------|----------------| -| **Purpose** | Import tasks FROM cloud storage | Export annotations TO cloud storage | -| **Triggering** | Manual sync via API/UI | Automatic via Django signals | -| **Data Flow** | Storage → Label Studio | Label Studio → Storage | -| **Validation** | Must check prefix exists | No prefix check (auto-created) | -| **Primary Methods** | `iter_objects()`, `get_data()` | `save_annotation()`, `save_annotations()` | -| **Threading** | Single-threaded iteration | Multi-threaded export (max_workers) | - -Each storage type follows this inheritance hierarchy: -```mermaid -graph TD - Storage-->ImportStorage - Storage-->ExportStorage - - ProjectStorageMixin-->NewImportStorage - ImportStorage-->NewImportStorageBase - - NewStorageMixin-->NewExportStorage - ExportStorage-->NewExportStorage +See comprehensive overview about storages @io_storages/README.md. - NewImportStorageBase-->NewImportStorage - - subgraph New Provider - NewImportStorage - NewImportStorageBase - NewExportStorage - NewStorageMixin[NewProviderStorageMixin] - end -``` -## Key Export Storage Insights +## Implementation Checklist -Based on the implementation patterns in the codebase, here are the critical aspects specific to export storages (target storages): +Follow all steps below to implement a new storage. More details follow after the checklist; review them all. Do it on your best, until all items are done and tests are passing. + +### 1. Exploration and preparation +1. [ ] Carefully read @io_storages/README.md +2. [ ] Search official documentation for the new storage you want to add + - [ ] Determine whether pre-signed URLs are supported, or whether only direct reads are possible. In case of direct reads, we should hide pre-signed URLs toggle and use Label Studio proxy. + - [ ] Determine whether writes are supported, and how annotations will be stored (objects/blobs, files, rows/strings in a table, etc.) + - [ ] Understand the provider's Python API/SDK, especially how to read, write, and list objects. If SDK is available, use SDK +3. If the requester hasn't specified the target edition, recommend Open Source or Enterprise and confirm the choice +4. Check storage directory structure in `label_studio/io_storages` (or `label_studio_enterprise/lse_io_storages` for Enterprise) and the `s3` (or `s3s` for Enterprise) subfolder +5. [ ] Create the new provider directory structure based on the pattern you observed +6. [ ] Create a README.md file in the new provider folder +7. [ ] Add a brief Overview section about the new storage and your findings from step 2 +8. [ ] Add a section on how to configure the storage from scratch for users unfamiliar with it. Provide clear, correct, up-to-date steps with links to official docs to reduce manual QA time + +### 2. Backend Implementation +1. [ ] Implement storage mixin with common fields: + - [ ] Basic fields: bucket, prefix, regex_filter, use_blob_urls (pre-signed URLs on/off), recursive_scan (if applicable) + - [ ] URL resolution: presign, presign_ttl (if applicable to the storage) + - [ ] Provider credentials: api_key, secret_key, endpoint_url + - [ ] Common methods: get_client(), validate_connection() +2. [ ] Create import storage base class with required methods: + - [ ] `iter_objects()` - iterate over storage objects + - [ ] `get_data()` - load task data from objects + - [ ] `generate_http_url()` - create HTTP URLs + - [ ] `can_resolve_url()` - check URL resolution capability + - [ ] `validate_connection()` - validate credentials and that the prefix contains files +3. [ ] Create export storage class with required methods: + - [ ] `save_annotation()` - save single annotation to storage + - [ ] `delete_annotation()` - delete annotation from storage (optional) + - [ ] `validate_connection()` - validate credentials and bucket access (NO prefix check) +4. [ ] Create non-abstract provider-specific concrete classes for import and export +5. [ ] Implement storage link models: + - [ ] ImportStorageLink for tracking task imports + - [ ] ExportStorageLink for tracking annotation exports +6. [ ] **CRITICAL: Add `app_label = 'io_storages'` to Meta classes** - All concrete storage models (ImportStorage, ExportStorage, and StorageLink classes) must include `app_label = 'io_storages'` in their Meta class to avoid Django app registration errors. This is required because storage providers are in subdirectories of `io_storages` but need to be registered under the main `io_storages` app. **Note**: Enterprise providers do NOT need app_label - see enterprise guide. +7. [ ] Create serializers with validation logic +8. [ ] Implement API views following existing patterns +9. [ ] Register URLs in storage URL configuration +10. [ ] Add signal handlers for auto-export functionality: + - [ ] post_save signal for automatic annotation export + - [ ] pre_delete signal for automatic annotation deletion + - [ ] Async export functions with error handling +11. [ ] If you use SDK: add provider SDK library to pyproject.toml + - [ ] Make poetry lock: `poetry install && poetry lock` +12. [ ] Create database migrations using `poetry run python manage.py makemigrations` only! +13. [ ] Ensure that you correctly handle token and security fields; they should not be displayed on the frontend or backend after they are initially entered and saved. Verify how this works with other storage codes. + +### 3. Frontend Implementation +1. [ ] Check examples: for Open Source see: `label-studio/web/apps/labelstudio/src/pages/Settings/StorageSettings/providers/`, for Enterprise see: `label-studio-enterprise/web/apps/labelstudio/src/pages/Settings/StorageSettings/providers/` +2. [ ] Create a provider configuration file in `web/apps/labelstudio/src/pages/Settings/StorageSettings/providers/` with: + - [ ] All required fields with proper types + - [ ] Zod validation schemas + - [ ] Meaningful labels and placeholders + - [ ] Proper field layout definition +3. [ ] Register provider in central registry +4. [ ] Mark credential fields with `accessKey: true` +5. [ ] Test form rendering and validation +6. [ ] Verify edit mode behavior for credentials + +### 4. Testing +- [ ] Write backend pytests for newly added API calls (see @backend-unit-tests.mdc for details) +- [ ] Test connection validation (validate_connection) +- [ ] Test object iteration and filtering (iter_objects) +- [ ] Test task data loading (get_data) +- [ ] Test frontend form functionality +- [ ] Test export annotations on Sync button click and when Submit button clicked (post save signal) +- [ ] Test delete exported annotation +- [ ] Critical: run all created tests, check how to run them in @backend-unit-tests.mdc + +### 5. Documentation +- [ ] Add provider to storage documentation (docs/source/guide/storage.md) +- [ ] Update API documentation using `@method_decorator` for storage API classes (see @updating-label-studio-sdk.mdc) + +### 6. Git +- [ ] Commit all added and modified files related to the new storage into git, use `git add ` and never use `git commit -a`. + +### 7. Integration & Deployment +These steps are for manual QA by the requester; remind them after you finish your work: +- [ ] Test end-to-end storage workflow + - [ ] Create a project, add a new import storage, sync it, and check Data Manager for new files + - [ ] Create a project, add a new export storage, create a few annotations, sync the storage, and check that files appear in the storage admin console +- [ ] Test URL resolution: verify that storage URIs like `s3://xxx/1.jpg` are resolved and load in the editor +- [ ] Test with both presigned URLs and proxy mode +- [ ] Test storage error and status reporting: if there are any errors, a user should be able to click "Failed - View logs" and see an error description -### 1. **Automatic vs Manual Operation** -- **Import storages**: Require manual sync via API calls -- **Export storages**: Automatically triggered by Django signals when annotations are created/updated -### 2. **Connection Validation Differences** -- **Import storages**: Must validate that prefix contains files during `validate_connection()` -- **Export storages**: Only validate bucket access, NOT prefix (prefixes are created automatically) +## Decision: Open Source vs Enterprise -### 3. **Data Serialization** -Export storages use `_get_serialized_data()` which returns different formats based on feature flags: -- **Default**: Only annotation data (backward compatibility) -- **With `fflag_feat_optic_650_target_storage_task_format_long`**: Full task + annotations data +**CRITICAL FIRST DECISION**: Where should your new storage provider be implemented? -### 4. **Built-in Threading** -- Export storages inherit `save_annotations()` with built-in parallel processing -- Uses ThreadPoolExecutor with configurable `max_workers` (default: min(8, cpu_count * 4)) -- Includes progress tracking and automatic batch processing +### Add to Open Source (`io_storages`) if: +- Basic authentication (API keys, service accounts) +- Standard file formats (JSON, JSONL, images) +- Community-focused features +- Simple cloud storage connectivity +- User request: the requester explicitly asks for Open Source -### 5. **Storage Links & Key Generation** -- **Import links**: Track task imports with custom keys -- **Export links**: Track annotation exports with keys based on feature flags: - - Default: `annotation.id` - - With feature flag: `task.id` + optional `.json` extension +### Add to Enterprise (`lse_io_storages`) if you need: +- **Advanced Authentication**: IAM roles, Workload Identity Federation, cross-account access +- **Enterprise Security**: Server-side encryption, ACL controls, audit logging +- **Advanced Data Formats**: Parquet support, complex data transformations +- **Billing Restrictions**: Storage limits, organizational constraints +- **Advanced Review Workflows**: Review-based export triggers +- **User request**: The requester explicitly asks for Enterprise + +### Key Enterprise Advantages + +1. **No App Label Issues**: LSE uses proper app configuration, avoiding Django registration conflicts +2. **Advanced Authentication**: Support for IAM roles, WIF, cross-account access +3. **Enhanced Security**: Built-in encryption, ACL controls, audit capabilities +4. **Enterprise Features**: Parquet support, billing controls, review workflows +5. **Better Error Handling**: Enhanced logging, metrics, and monitoring +6. **Scalability**: Client caching, optimized performance patterns + +**Default Recommendation**: Most new storage providers should be added to Enterprise for better security, features, and future extensibility. -### 6. **Optional Deletion Support** -- Export storages can implement `delete_annotation()` -- Controlled by `can_delete_objects` field -- Automatically called when annotations are deleted from Label Studio ## Backend Implementation +Important note: if you implement a storage for Label Studio Enterprise, replace all paths `label_studio/io_storages` with `label_studio_enterprise/lse_io_storages`. + ### 1. Create Storage Models #### File Structure @@ -136,7 +191,7 @@ class YourProviderImportStorage(ProjectStorageMixin, YourProviderImportStorageBa #### Export Storage Class -**Reference Implementation**: Follow `io_storages/s3/models.py` `S3ExportStorage` class +**Reference Implementation**: Follow `io_storages/s3/models.py` `S3ExportStorage` class as an example **Required Class**: ```python @@ -230,8 +285,7 @@ Create API views in `label_studio/io_storages/yourprovider/api.py`: **Key Features**: 1. **OpenAPI Documentation**: Use `@method_decorator` with `extend_schema` for each endpoint -2. **Proper Tags**: Use `['Storage: YourProvider']` and `['Export Storage: YourProvider']` -3. **Queryset & Serializer**: Set `queryset` and `serializer_class` for each view +2. **Queryset & Serializer**: Set `queryset` and `serializer_class` for each view ### 4. Register URLs @@ -254,9 +308,9 @@ path('api/storages/yourprovider/', include(('io_storages.yourprovider.urls', 'io ### Create Provider Configuration -**Reference Implementation**: Follow `web/lib/app-common/src/blocks/StorageProviderForm/providers/s3.ts` +**Reference Implementation**: Follow `web/apps/labelstudio/src/pages/Settings/StorageSettings/providers/s3.ts` -**Create**: `web/lib/app-common/src/blocks/StorageProviderForm/providers/yourprovider.ts` +**Create**: `web/apps/labelstudio/src/pages/Settings/StorageSettings/providers/yourprovider.ts` **Required Structure**: ```ts @@ -285,7 +339,7 @@ const yourProviderProvider: ProviderConfig = { ## Testing -**Reference Implementation**: Follow `label_studio/io_storages/tests/test_s3.py` patterns +**Reference Implementation**: Follow tests in `label_studio/io_storages/tests/` and `label_studio_enterprise/lse_io_storages/tests/`. Useful examples include `test_import_storage_list_files_api.py`, `test_proxy_api.py`, and `test_get_bytes_stream.py`. **Create**: `label_studio/io_storages/tests/test_yourprovider.py` @@ -300,72 +354,10 @@ const yourProviderProvider: ProviderConfig = { - `test_data_loading()` - Test task data loading from storage objects - `test_export_functionality()` - Test annotation export and deletion -## Implementation Checklist - -### Backend Implementation -- [ ] Create provider directory structure -- [ ] Implement storage mixin with common fields: - - [ ] Basic fields: bucket, prefix, regex_filter, use_blob_urls - - [ ] URL resolution: presign, presign_ttl - - [ ] Provider credentials: api_key, secret_key, endpoint_url - - [ ] Common methods: get_client(), validate_connection() -- [ ] Create import storage base class with required methods: - - [ ] `iter_objects()` - iterate over storage objects - - [ ] `get_data()` - load task data from objects - - [ ] `generate_http_url()` - create HTTP URLs - - [ ] `can_resolve_url()` - check URL resolution capability - - [ ] `validate_connection()` - validate credentials and prefix has files -- [ ] Create export storage class with required methods: - - [ ] `save_annotation()` - save single annotation to storage - - [ ] `delete_annotation()` - delete annotation from storage (optional) - - [ ] `validate_connection()` - validate credentials and bucket access (NO prefix check) -- [ ] Create concrete import/export storage classes -- [ ] Implement storage link models: - - [ ] ImportStorageLink for tracking task imports - - [ ] ExportStorageLink for tracking annotation exports -- [ ] **CRITICAL: Add `app_label = 'io_storages'` to Meta classes** - All concrete storage models (ImportStorage, ExportStorage, and StorageLink classes) must include `app_label = 'io_storages'` in their Meta class to avoid Django app registration errors. This is required because storage providers are in subdirectories of `io_storages` but need to be registered under the main `io_storages` app. **Note**: Enterprise providers do NOT need app_label - see enterprise guide. -- [ ] Create serializers with validation logic -- [ ] Implement API views following existing patterns -- [ ] Register URLs in storage URL configuration -- [ ] Add signal handlers for auto-export functionality: - - [ ] post_save signal for automatic annotation export - - [ ] pre_delete signal for automatic annotation deletion - - [ ] Async export functions with error handling -- [ ] Create database migrations -- [ ] Add basic pytests for newly added API calls - -### Frontend Implementation -- [ ] Create provider configuration file with: - - [ ] All required fields with proper types - - [ ] Zod validation schemas - - [ ] Meaningful labels and placeholders - - [ ] Proper field layout definition -- [ ] Register provider in central registry -- [ ] Mark credential fields with `accessKey: true` -- [ ] Test form rendering and validation -- [ ] Verify edit mode behavior for credentials - -### Testing & Documentation -- [ ] Write backend unit tests -- [ ] Test connection validation -- [ ] Test object iteration and filtering -- [ ] Test task data loading -- [ ] Test frontend form functionality -- [ ] Test both create and edit modes -- [ ] Update API documentation -- [ ] Add provider to storage documentation (docs/source/guide/storage.md) - -### Integration & Deployment -- [ ] Test end-to-end storage workflow -- [ ] Verify task import/export functionality -- [ ] Test URL resolution and proxy functionality -- [ ] Test with both presigned URLs and proxy mode -- [ ] Verify error handling and user feedback -- [ ] Test storage sync and status reporting - ## Common Issues & Solutions ### Django App Label Error + **Error**: `RuntimeError: Model class doesn't declare an explicit app_label and isn't in an application in INSTALLED_APPS` **Cause**: Storage provider models in subdirectories (e.g., `io_storages.databricks`) are not automatically recognized as belonging to the `io_storages` app. @@ -394,6 +386,7 @@ class YourProviderImportStorageLink(ImportStorageLink): **Note**: This is required for ALL concrete models (not abstract ones) including storage classes and link models. ### Django Model Conflict Error + **Error**: `RuntimeError: Conflicting 'providernameimportstorage' models in application 'io_storages'` **Cause**: Django is registering the same model through two different import paths: diff --git a/docs/source/guide/storage.md b/docs/source/guide/storage.md index 7627d744a3fa..b18d8877234c 100644 --- a/docs/source/guide/storage.md +++ b/docs/source/guide/storage.md @@ -19,6 +19,8 @@ Set up the following cloud and other storage systems with Label Studio: - [Microsoft Azure Blob storage](#Microsoft-Azure-Blob-storage) - [Redis database](#Redis-database) - [Local storage](#Local-storage)
(for On-prem only)
+- [Databricks Files (UC Volumes)](#Databricks-Files-UC-Volumes) + ## Troubleshooting @@ -43,6 +45,7 @@ For more troubleshooting information, see [Troubleshooting Import, Export, & Sto + ## How external storage connections and sync work You can add source storage connections to sync data from an external source to a Label Studio project, and add target storage connections to sync annotations from Label Studio to external storage. Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project. @@ -1483,7 +1486,86 @@ You can also create a storage connection using the Label Studio API. If you're using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See [Run Label Studio on Docker and use local storage](https://labelstud.io/guide/start#Run-Label-Studio-on-Docker-and-use-Local-Storage). -### Troubleshooting cloud storage + + +## Databricks Files (UC Volumes) + +
+ +Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (no presigned URLs are supported by Databricks). + +### Prerequisites +- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain) +- A Databricks Personal Access Token (PAT) with permission to access the Files API +- A UC Volume path under `/Volumes///` with files you want to label + +References: +- Databricks workspace: https://docs.databricks.com/en/getting-started/index.html +- Personal access tokens: https://docs.databricks.com/en/dev-tools/auth/pat.html +- Unity Catalog and Volumes: https://docs.databricks.com/en/files/volumes.html + +### Set up connection in the Label Studio UI +1. Open Label Studio → project → **Settings > Cloud Storage**. +2. Click **Add Source Storage**. Select **Databricks Files (UC Volumes)**. +3. Configure the connection: + - Workspace Host: your Databricks workspace base URL (no trailing slash) + - Access Token: your PAT + - Catalog / Schema / Volume: Unity Catalog coordinates + - Click **Next** to open Import Settings & Preview +4. Import Settings & Preview: + - Bucket Prefix (optional): relative subpath under the volume (e.g., `images/train`) + - File Name Filter (optional): regex to filter files (e.g., `.*\.json$`) + - Scan all sub-folders: enable for recursive listing; disable to list only current folder + - Click **Load preview** to verify files +5. Click **Save** (or **Save & Sync**) to create the connection and sync tasks. + +### Target storage (export) +1. Open **Settings > Cloud Storage** → **Add Target Storage** → **Databricks Files (UC Volumes)**. +2. Use the same Workspace Host/Token and UC coordinates. +3. Set an Export Prefix (e.g., `exports/${project_id}`). +4. Click **Save** and then **Sync** to push annotations as JSON files to your volume. + +!!! note "URI schema" + To reference Databricks files directly in task JSON (without using an Import Storage), use Label Studio’s Databricks URI scheme: + + `dbx://Volumes////` + + Example: + + ``` + { "image": "dbx://Volumes/main/default/dataset/images/1.jpg" } + ``` + + +!!! note "Troubleshooting" + - If listing returns zero files, verify the path under `/Volumes////` and your PAT permissions. + - Ensure the Workspace Host has no trailing slash and matches your workspace domain. + - If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio and network egress allows Label Studio to reach Databricks. + + +!!! warning "Proxy and security" + This connector streams data **through the Label Studio backend** with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio. + + +
+ +
+ +### Use Databricks Files in Label Studio Enterprise + +Databricks Unity Catalog (UC) Volumes integration is available in Label Studio Enterprise. It lets you: + +- Import files directly from UC Volumes under `/Volumes///` +- Stream media securely via the platform proxy (no presigned URLs) +- Export annotations back to your Databricks Volume as JSON + +Learn more and see the full setup guide in the Enterprise documentation: [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes). If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/). + +
+ + + +## Troubleshooting cloud storage
diff --git a/label_studio/io_storages/README.md b/label_studio/io_storages/README.md index d234ca216ab4..017673a2791a 100644 --- a/label_studio/io_storages/README.md +++ b/label_studio/io_storages/README.md @@ -1,21 +1,21 @@ # Cloud Storages -There are 3 basic types of cloud storages: +Cloud storage is used for importing tasks and exporting annotations in Label Studio. There are 2 basic types of cloud storages: 1. Import Storages (aka Source Cloud Storages) 2. Export Storages (aka Target Cloud Storages) -3. Dataset Storages (available in enterprise) Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse `Cloud Storages` and `Persistent Storage`, they have completely different codebase and tasks. Cloud Storages are implemented in `io_storages`, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see `base.py`). +Note: Dataset Storages were implemented in the enterprise codebase only. They are **deprecated and not used**. +## Basic hierarchy +This section uses GCS storage as an example, and the same logic can be applied to other storages. -## Basic hierarchy - -### Import and Dataset Storages +### Import Storages -This diagram is based on Google Cloud Storage (GCS) and other storages are implemented the same way. +This storage type is designed for importing tasks FROM cloud storage to Label Studio. This diagram is based on Google Cloud Storage (GCS), and other storages are implemented in the same way: ```mermaid graph TD; @@ -28,7 +28,7 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple GCSImportStorageBase-->GCSImportStorage; GCSImportStorageBase-->GCSDatasetStorage; - DatasetStorageMixin-->GCSDatasetStorage; + GCSImportStorageLink-->ImportStorageLink subgraph Google Cloud Storage GCSImportStorage; @@ -37,7 +37,52 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple end ``` +- **Storage** (`label_studio/io_storages/base_models.py`): Abstract base for all storages. Inherits status/progress from `StorageInfo`. Defines `validate_connection()` contract and common metadata fields. + +- **ImportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for source storages. Defines core contracts used by sync and proxy: + - `iter_objects()`, `iter_keys()` to enumerate objects + - `get_unified_metadata(obj)` to normalize provider metadata + - `get_data(key)` to produce `StorageObject`(s) for task creation + - `generate_http_url(url)` to resolve provider URL -> HTTP URL (presigned or direct) + - `resolve_uri(...)` and `can_resolve_url(...)` used by the Storage Proxy + - `scan_and_create_links()` to create `ImportStorageLink`s for tasks + +- **ImportStorageLink** (`label_studio/io_storages/base_models.py`): Link model created per-task for imported objects. Fields: `task` (1:1), `key` (external key), `row_group`/`row_index` (parquet/JSONL indices), `object_exists`, timestamps. Helpers: `n_tasks_linked(key, storage)` and `create(task, key, storage, row_index=None, row_group=None)`. + +- **ProjectStorageMixin** (`label_studio/io_storages/base_models.py`): Adds `project` FK and permission checks. Used by project-scoped storages (e.g., `GCSImportStorage`). + +- **GCSImportStorageBase** (`label_studio/io_storages/gcs/models.py`): GCS-specific import base. Sets `url_scheme='gs'`, implements listing (`iter_objects/iter_keys`), data loading (`get_data`), URL generation (`generate_http_url`), URL resolution checks, and metadata helpers. Reused by both project imports and enterprise datasets. + +- **GCSImportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete project-scoped GCS import storage combining `ProjectStorageMixin` + `GCSImportStorageBase`. +- **GCSImportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific `ImportStorageLink` with `storage` FK to `GCSImportStorage`. Created during sync to associate a task with the original GCS object key. + +### Export Storages + +This storage type is designed for exporting tasks or annotations FROM Label Studio to cloud storage. + +```mermaid + graph TD; + + Storage-->ExportStorage; + + ProjectStorageMixin-->ExportStorage; + ExportStorage-->GCSExportStorage; + GCSStorageMixin-->GCSExportStorage; + + ExportStorageLink-->GCSExportStorageLink; +``` + +- **ExportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for target storages. Project-scoped; orchestrates export jobs and progress. Key methods: + - `save_annotation(annotation)` provider-specific write + - `save_annotations(queryset)`, `save_all_annotations()`, `save_only_new_annotations()` helpers + - `sync(save_only_new_annotations=False)` background export via RQ + +- **GCSExportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete target storage for GCS. Serializes data via `_get_serialized_data(...)`, computes key via `GCSExportStorageLink.get_key(...)`, uploads to GCS; can auto-export on annotation save when configured. + +- **ExportStorageLink** (`label_studio/io_storages/base_models.py`): Base link model connecting exported objects to `Annotation`s. Provides `get_key(annotation)` logic (task-based or annotation-based via FF) and `create(...)` helper. + +- **GCSExportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific link model holding FK to `GCSExportStorage`. ## How validate_connection() works @@ -50,32 +95,43 @@ Run this command with try/except: Target storages use the same validate_connection() function, but without any prefix. -## Google Cloud Storage (GCS) +## Key Storage Insights + +### 1. **Primary Methods** +- **Import storages**: `iter_objects()`, `get_data()` +- **Export storages**: `save_annotation()`, `save_annotations()` + +### 2. **Automatic vs Manual Operation** +- **Import storages**: Require manual sync via API calls or UI +- **Export storages**: Manual sync via API/UI | Manual sync via API/UI and automatic via Django signals when annotations are submitted or updated -### Credentials +### 3. **Connection Validation Differences** +- **Import storages**: Must validate that prefix contains files during `validate_connection()` +- **Export storages**: Only validate bucket access, NOT prefix (prefixes are created automatically) -There are two methods for setting GCS credentials: -1. Through the Project => Cloud Storage settings in the Label Studio user interface. -2. Through Google Application Default Credentials (ADC). This involves the following steps: +### 4. **Data Serialization** +Export storages use `_get_serialized_data()` which returns different formats based on feature flags: +- **Default**: Only annotation data (backward compatibility) +- **With `fflag_feat_optic_650_target_storage_task_format_long` or `FUTURE_SAVE_TASK_TO_STORAGE`**: Full task + annotations data instead of annotation per file output. - 2.1. Leave the Google Application Credentials field in the Label Studio UI blank. - - 2.2. Set an environment variable which will apply to all Cloud Storages. This can be done using the following command: - ```bash - export GOOGLE_APPLICATION_CREDENTIALS=google_credentials.json - ``` - 2.3. Alternatively, use the following command: - ```bash - gcloud auth application-default login - ``` - 2.4. Another option is to use credentials provided by the Google App Engine or Google Compute Engine metadata server, if the code is running on either GAE or GCE. +### 5. **Built-in Threading** +- Export storages inherit `save_annotations()` with built-in parallel processing +- Uses ThreadPoolExecutor with configurable `max_workers` (default: min(8, cpu_count * 4)) +- Includes progress tracking and automatic batch processing -Note: If Cloud Storage credentials are set in the Label Studio UI, these will take precedence over other methods. +### 6. **Storage Links & Key Generation** +- **Import links**: Track task imports with custom keys +- **Export links**: Track annotation exports with keys based on feature flags: + - Default: `annotation.id` + - With feature flag: `task.id` + optional `.json` extension - +### 7. **Optional Deletion Support** +- Export storages can implement `delete_annotation()` +- Controlled by `can_delete_objects` field +- Automatically called when annotations are deleted from Label Studio -## Storage statuses and how they are processed +## StorageInfo statuses and how they are processed Storage (Import and Export) have different statuses of synchronization (see `class StorageInfo.Status`): @@ -94,7 +150,7 @@ Storage (Import and Export) have different statuses of synchronization (see `cla InProgress-->Completed; ``` -Additionally, StorageInfo contains counters and debug information that will be displayed in storages: +Additionally, class **StorageInfo** contains counters and debug information that will be displayed in storages: * last_sync - time of the last successful sync * last_sync_count - number of objects that were successfully synced diff --git a/label_studio/io_storages/api.py b/label_studio/io_storages/api.py index 09b780a9bda3..c45b34171edc 100644 --- a/label_studio/io_storages/api.py +++ b/label_studio/io_storages/api.py @@ -174,7 +174,7 @@ def create(self, request, *args, **kwargs): from .functions import validate_storage_instance instance = validate_storage_instance(request, self.serializer_class) - limit = request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT) + limit = int(request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT)) try: files = [] diff --git a/label_studio/io_storages/proxy_api.py b/label_studio/io_storages/proxy_api.py index 0e7b12817204..432ab77dc275 100644 --- a/label_studio/io_storages/proxy_api.py +++ b/label_studio/io_storages/proxy_api.py @@ -199,7 +199,12 @@ def prepare_headers(self, response, metadata, request, project): if metadata.get('ContentRange'): response.headers['Content-Range'] = metadata['ContentRange'] if metadata.get('LastModified'): - response.headers['Last-Modified'] = metadata['LastModified'].strftime('%a, %d %b %Y %H:%M:%S GMT') + last_mod = metadata['LastModified'] + # Accept either datetime-like (has strftime) or preformatted string + if hasattr(last_mod, 'strftime'): + response.headers['Last-Modified'] = last_mod.strftime('%a, %d %b %Y %H:%M:%S GMT') + else: + response.headers['Last-Modified'] = str(last_mod) # Always enable range requests response.headers['Accept-Ranges'] = 'bytes' diff --git a/poetry.lock b/poetry.lock index b0abcbde3caa..5194a477f4c8 100644 --- a/poetry.lock +++ b/poetry.lock @@ -2136,7 +2136,7 @@ optional = false python-versions = ">=3.9,<4" groups = ["main"] files = [ - {file = "228d91101d156d61c2d6334cc6cc0bfe1a79c254.zip", hash = "sha256:ca78227c3b77bc873d9b8e86956e5cf455410f6db63a172e48df4508f7d6309f"}, + {file = "77b0c0abd2847c914096e6054b6f1b1805ee1b7a.zip", hash = "sha256:a29f9b9db793edfb97cbf384a2c9aac780656ca37893c998e412577f4e9277ad"}, ] [package.dependencies] @@ -2164,7 +2164,7 @@ xmljson = "0.2.1" [package.source] type = "url" -url = "https://github.com/HumanSignal/label-studio-sdk/archive/228d91101d156d61c2d6334cc6cc0bfe1a79c254.zip" +url = "https://github.com/HumanSignal/label-studio-sdk/archive/77b0c0abd2847c914096e6054b6f1b1805ee1b7a.zip" [[package]] name = "launchdarkly-server-sdk" @@ -5109,4 +5109,4 @@ uwsgi = ["pyuwsgi", "uwsgitop"] [metadata] lock-version = "2.1" python-versions = ">=3.10,<4" -content-hash = "01a88d20f4dd96c7a03d84466d8296c42f3979b692abe85635380fda345d73a1" +content-hash = "27d1096f39b9864ce20797fcaabcdf470de9b87aedf0b2bfe0b7b109096320a0" diff --git a/pyproject.toml b/pyproject.toml index 2dff30437f7d..cb1a83035a58 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -74,7 +74,7 @@ dependencies = [ "tldextract (>=5.1.3)", "uuid-utils (>=0.11.0,<1.0.0)", ## HumanSignal repo dependencies :start - "label-studio-sdk @ https://github.com/HumanSignal/label-studio-sdk/archive/228d91101d156d61c2d6334cc6cc0bfe1a79c254.zip", + "label-studio-sdk @ https://github.com/HumanSignal/label-studio-sdk/archive/77b0c0abd2847c914096e6054b6f1b1805ee1b7a.zip", ## HumanSignal repo dependencies :end ] diff --git a/web/libs/app-common/src/blocks/StorageProviderForm/hooks/useStorageApi.ts b/web/libs/app-common/src/blocks/StorageProviderForm/hooks/useStorageApi.ts index e6c9418e5356..08c4ef7bdc31 100644 --- a/web/libs/app-common/src/blocks/StorageProviderForm/hooks/useStorageApi.ts +++ b/web/libs/app-common/src/blocks/StorageProviderForm/hooks/useStorageApi.ts @@ -187,11 +187,11 @@ export const useStorageApi = ({ target, storage, project, onSubmit, onClose }: U if (isDefined(storage?.id)) { body.id = storage.id; + body.limit = 30; } return api.callApi<{ files: any[] }>("storageFiles", { params: { - limit: 10, target, type: previewData.provider, }, diff --git a/web/libs/ui/src/assets/icons/cloud-provider-databricks.svg b/web/libs/ui/src/assets/icons/cloud-provider-databricks.svg new file mode 100644 index 000000000000..41633407f177 --- /dev/null +++ b/web/libs/ui/src/assets/icons/cloud-provider-databricks.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/web/libs/ui/src/assets/icons/index.ts b/web/libs/ui/src/assets/icons/index.ts index 135e3b0200c8..2d22d9de089f 100644 --- a/web/libs/ui/src/assets/icons/index.ts +++ b/web/libs/ui/src/assets/icons/index.ts @@ -261,3 +261,4 @@ export { ReactComponent as IconCloudProviderS3 } from "./cloud-provider-s3.svg"; export { ReactComponent as IconCloudProviderRedis } from "./cloud-provider-redis.svg"; export { ReactComponent as IconCloudProviderGCS } from "./cloud-provider-gcs.svg"; export { ReactComponent as IconCloudProviderAzure } from "./cloud-provider-azure.svg"; +export { ReactComponent as IconCloudProviderDatabricks } from "./cloud-provider-databricks.svg";