Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
f2905dc
feat: BROS-353: Databricks storage integration
makseq Aug 22, 2025
f43c2ca
Sync Follow Merge dependencies
robot-ci-heartex Aug 23, 2025
1c8659c
Merge branch 'develop' into 'fb-bros-353'
robot-ci-heartex Aug 23, 2025
6f90270
Updates in rules
makseq Sep 8, 2025
918f5c7
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Sep 8, 2025
dd60f9e
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Sep 9, 2025
1ea214c
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Sep 12, 2025
2130b26
Merge branch 'fb-bros-353' of github.com:heartexlabs/label-studio int…
makseq Sep 12, 2025
603c750
Merge branch 'develop' into 'fb-bros-353'
makseq Sep 12, 2025
ff564c9
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
6e8e9b4
Merge branch 'develop' into 'fb-bros-353'
robot-ci-heartex Sep 13, 2025
653827c
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
d2d26bc
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
f5689e4
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
eea9de7
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
bd797b7
Merge branch 'develop' into 'fb-bros-353'
robot-ci-heartex Sep 13, 2025
6505c26
Sync Follow Merge dependencies
robot-ci-heartex Sep 13, 2025
f1f96d4
Delete label_studio/io_storages/gcs/README.md
makseq Sep 13, 2025
84c58d2
Delete label_stream.md
makseq Sep 13, 2025
85edbfa
Delete review_stream.md
makseq Sep 13, 2025
2f8bd96
Add icon. Add datetime fix
makseq Sep 13, 2025
a2bd260
Fix name for icon
makseq Sep 13, 2025
e7e0315
Add docs
makseq Sep 13, 2025
02893d8
Fix docs
makseq Sep 14, 2025
0ad74ce
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
7cc936b
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
07801b7
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
ea23358
Add path in docs
makseq Sep 14, 2025
4587772
Fix limits for file previews
makseq Sep 14, 2025
dc72043
Fix limit again
makseq Sep 14, 2025
8ca7d22
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
ab1a5e0
Merge branch 'develop' into 'fb-bros-353'
robot-ci-heartex Sep 14, 2025
835a379
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
3858a8c
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
3465f49
Sync Follow Merge dependencies
robot-ci-heartex Sep 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
255 changes: 124 additions & 131 deletions .cursor/rules/storage-provider.mdc

Large diffs are not rendered by default.

84 changes: 83 additions & 1 deletion docs/source/guide/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Set up the following cloud and other storage systems with Label Studio:
- [Microsoft Azure Blob storage](#Microsoft-Azure-Blob-storage)
- [Redis database](#Redis-database)
- [Local storage](#Local-storage) <div class="enterprise-only">(for On-prem only)</div>
- [Databricks Files (UC Volumes)](#Databricks-Files-UC-Volumes)


## Troubleshooting

Expand All @@ -43,6 +45,7 @@ For more troubleshooting information, see [Troubleshooting Import, Export, & Sto

</div>


## How external storage connections and sync work

You can add source storage connections to sync data from an external source to a Label Studio project, and add target storage connections to sync annotations from Label Studio to external storage. Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project.
Expand Down Expand Up @@ -1483,7 +1486,86 @@ You can also create a storage connection using the Label Studio API.
If you're using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See [Run Label Studio on Docker and use local storage](https://labelstud.io/guide/start#Run-Label-Studio-on-Docker-and-use-Local-Storage).


### Troubleshooting cloud storage


## Databricks Files (UC Volumes)

<div class="enterprise-only">

Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (no presigned URLs are supported by Databricks).

### Prerequisites
- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain)
- A Databricks Personal Access Token (PAT) with permission to access the Files API
- A UC Volume path under `/Volumes/<catalog>/<schema>/<volume>` with files you want to label

References:
- Databricks workspace: https://docs.databricks.com/en/getting-started/index.html
- Personal access tokens: https://docs.databricks.com/en/dev-tools/auth/pat.html
- Unity Catalog and Volumes: https://docs.databricks.com/en/files/volumes.html

### Set up connection in the Label Studio UI
1. Open Label Studio → project → **Settings > Cloud Storage**.
2. Click **Add Source Storage**. Select **Databricks Files (UC Volumes)**.
3. Configure the connection:
- Workspace Host: your Databricks workspace base URL (no trailing slash)
- Access Token: your PAT
- Catalog / Schema / Volume: Unity Catalog coordinates
- Click **Next** to open Import Settings & Preview
4. Import Settings & Preview:
- Bucket Prefix (optional): relative subpath under the volume (e.g., `images/train`)
- File Name Filter (optional): regex to filter files (e.g., `.*\.json$`)
- Scan all sub-folders: enable for recursive listing; disable to list only current folder
- Click **Load preview** to verify files
5. Click **Save** (or **Save & Sync**) to create the connection and sync tasks.

### Target storage (export)
1. Open **Settings > Cloud Storage** → **Add Target Storage** → **Databricks Files (UC Volumes)**.
2. Use the same Workspace Host/Token and UC coordinates.
3. Set an Export Prefix (e.g., `exports/${project_id}`).
4. Click **Save** and then **Sync** to push annotations as JSON files to your volume.

!!! note "URI schema"
To reference Databricks files directly in task JSON (without using an Import Storage), use Label Studio’s Databricks URI scheme:

`dbx://Volumes/<catalog>/<schema>/<volume>/<path>`

Example:

```
{ "image": "dbx://Volumes/main/default/dataset/images/1.jpg" }
```


!!! note "Troubleshooting"
- If listing returns zero files, verify the path under `/Volumes/<catalog>/<schema>/<volume>/<prefix?>` and your PAT permissions.
- Ensure the Workspace Host has no trailing slash and matches your workspace domain.
- If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio and network egress allows Label Studio to reach Databricks.


!!! warning "Proxy and security"
This connector streams data **through the Label Studio backend** with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio.


</div>

<div class="opensource-only">

### Use Databricks Files in Label Studio Enterprise

Databricks Unity Catalog (UC) Volumes integration is available in Label Studio Enterprise. It lets you:

- Import files directly from UC Volumes under `/Volumes/<catalog>/<schema>/<volume>`
- Stream media securely via the platform proxy (no presigned URLs)
- Export annotations back to your Databricks Volume as JSON

Learn more and see the full setup guide in the Enterprise documentation: [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes). If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/).

</div>



## Troubleshooting cloud storage

<div class="opensource-only">

Expand Down
110 changes: 83 additions & 27 deletions label_studio/io_storages/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Cloud Storages

There are 3 basic types of cloud storages:
Cloud storage is used for importing tasks and exporting annotations in Label Studio. There are 2 basic types of cloud storages:

1. Import Storages (aka Source Cloud Storages)
2. Export Storages (aka Target Cloud Storages)
3. Dataset Storages (available in enterprise)

Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse `Cloud Storages` and `Persistent Storage`, they have completely different codebase and tasks. Cloud Storages are implemented in `io_storages`, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see `base.py`).

Note: Dataset Storages were implemented in the enterprise codebase only. They are **deprecated and not used**.

## Basic hierarchy

This section uses GCS storage as an example, and the same logic can be applied to other storages.

## Basic hierarchy

### Import and Dataset Storages
### Import Storages

This diagram is based on Google Cloud Storage (GCS) and other storages are implemented the same way.
This storage type is designed for importing tasks FROM cloud storage to Label Studio. This diagram is based on Google Cloud Storage (GCS), and other storages are implemented in the same way:

```mermaid
graph TD;
Expand All @@ -28,7 +28,7 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
GCSImportStorageBase-->GCSImportStorage;
GCSImportStorageBase-->GCSDatasetStorage;

DatasetStorageMixin-->GCSDatasetStorage;
GCSImportStorageLink-->ImportStorageLink

subgraph Google Cloud Storage
GCSImportStorage;
Expand All @@ -37,7 +37,52 @@ This diagram is based on Google Cloud Storage (GCS) and other storages are imple
end
```

- **Storage** (`label_studio/io_storages/base_models.py`): Abstract base for all storages. Inherits status/progress from `StorageInfo`. Defines `validate_connection()` contract and common metadata fields.

- **ImportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for source storages. Defines core contracts used by sync and proxy:
- `iter_objects()`, `iter_keys()` to enumerate objects
- `get_unified_metadata(obj)` to normalize provider metadata
- `get_data(key)` to produce `StorageObject`(s) for task creation
- `generate_http_url(url)` to resolve provider URL -> HTTP URL (presigned or direct)
- `resolve_uri(...)` and `can_resolve_url(...)` used by the Storage Proxy
- `scan_and_create_links()` to create `ImportStorageLink`s for tasks

- **ImportStorageLink** (`label_studio/io_storages/base_models.py`): Link model created per-task for imported objects. Fields: `task` (1:1), `key` (external key), `row_group`/`row_index` (parquet/JSONL indices), `object_exists`, timestamps. Helpers: `n_tasks_linked(key, storage)` and `create(task, key, storage, row_index=None, row_group=None)`.

- **ProjectStorageMixin** (`label_studio/io_storages/base_models.py`): Adds `project` FK and permission checks. Used by project-scoped storages (e.g., `GCSImportStorage`).

- **GCSImportStorageBase** (`label_studio/io_storages/gcs/models.py`): GCS-specific import base. Sets `url_scheme='gs'`, implements listing (`iter_objects/iter_keys`), data loading (`get_data`), URL generation (`generate_http_url`), URL resolution checks, and metadata helpers. Reused by both project imports and enterprise datasets.

- **GCSImportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete project-scoped GCS import storage combining `ProjectStorageMixin` + `GCSImportStorageBase`.

- **GCSImportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific `ImportStorageLink` with `storage` FK to `GCSImportStorage`. Created during sync to associate a task with the original GCS object key.

### Export Storages

This storage type is designed for exporting tasks or annotations FROM Label Studio to cloud storage.

```mermaid
graph TD;

Storage-->ExportStorage;

ProjectStorageMixin-->ExportStorage;
ExportStorage-->GCSExportStorage;
GCSStorageMixin-->GCSExportStorage;

ExportStorageLink-->GCSExportStorageLink;
```

- **ExportStorage** (`label_studio/io_storages/base_models.py`): Abstract base for target storages. Project-scoped; orchestrates export jobs and progress. Key methods:
- `save_annotation(annotation)` provider-specific write
- `save_annotations(queryset)`, `save_all_annotations()`, `save_only_new_annotations()` helpers
- `sync(save_only_new_annotations=False)` background export via RQ

- **GCSExportStorage** (`label_studio/io_storages/gcs/models.py`): Concrete target storage for GCS. Serializes data via `_get_serialized_data(...)`, computes key via `GCSExportStorageLink.get_key(...)`, uploads to GCS; can auto-export on annotation save when configured.

- **ExportStorageLink** (`label_studio/io_storages/base_models.py`): Base link model connecting exported objects to `Annotation`s. Provides `get_key(annotation)` logic (task-based or annotation-based via FF) and `create(...)` helper.

- **GCSExportStorageLink** (`label_studio/io_storages/gcs/models.py`): Provider-specific link model holding FK to `GCSExportStorage`.

## How validate_connection() works

Expand All @@ -50,32 +95,43 @@ Run this command with try/except:
Target storages use the same validate_connection() function, but without any prefix.


## Google Cloud Storage (GCS)
## Key Storage Insights

### 1. **Primary Methods**
- **Import storages**: `iter_objects()`, `get_data()`
- **Export storages**: `save_annotation()`, `save_annotations()`

### 2. **Automatic vs Manual Operation**
- **Import storages**: Require manual sync via API calls or UI
- **Export storages**: Manual sync via API/UI | Manual sync via API/UI and automatic via Django signals when annotations are submitted or updated

### Credentials
### 3. **Connection Validation Differences**
- **Import storages**: Must validate that prefix contains files during `validate_connection()`
- **Export storages**: Only validate bucket access, NOT prefix (prefixes are created automatically)

There are two methods for setting GCS credentials:
1. Through the Project => Cloud Storage settings in the Label Studio user interface.
2. Through Google Application Default Credentials (ADC). This involves the following steps:
### 4. **Data Serialization**
Export storages use `_get_serialized_data()` which returns different formats based on feature flags:
- **Default**: Only annotation data (backward compatibility)
- **With `fflag_feat_optic_650_target_storage_task_format_long` or `FUTURE_SAVE_TASK_TO_STORAGE`**: Full task + annotations data instead of annotation per file output.

2.1. Leave the Google Application Credentials field in the Label Studio UI blank.

2.2. Set an environment variable which will apply to all Cloud Storages. This can be done using the following command:
```bash
export GOOGLE_APPLICATION_CREDENTIALS=google_credentials.json
```
2.3. Alternatively, use the following command:
```bash
gcloud auth application-default login
```
2.4. Another option is to use credentials provided by the Google App Engine or Google Compute Engine metadata server, if the code is running on either GAE or GCE.
### 5. **Built-in Threading**
- Export storages inherit `save_annotations()` with built-in parallel processing
- Uses ThreadPoolExecutor with configurable `max_workers` (default: min(8, cpu_count * 4))
- Includes progress tracking and automatic batch processing

Note: If Cloud Storage credentials are set in the Label Studio UI, these will take precedence over other methods.
### 6. **Storage Links & Key Generation**
- **Import links**: Track task imports with custom keys
- **Export links**: Track annotation exports with keys based on feature flags:
- Default: `annotation.id`
- With feature flag: `task.id` + optional `.json` extension


### 7. **Optional Deletion Support**
- Export storages can implement `delete_annotation()`
- Controlled by `can_delete_objects` field
- Automatically called when annotations are deleted from Label Studio


## Storage statuses and how they are processed
## StorageInfo statuses and how they are processed

Storage (Import and Export) have different statuses of synchronization (see `class StorageInfo.Status`):

Expand All @@ -94,7 +150,7 @@ Storage (Import and Export) have different statuses of synchronization (see `cla
InProgress-->Completed;
```

Additionally, StorageInfo contains counters and debug information that will be displayed in storages:
Additionally, class **StorageInfo** contains counters and debug information that will be displayed in storages:

* last_sync - time of the last successful sync
* last_sync_count - number of objects that were successfully synced
Expand Down
2 changes: 1 addition & 1 deletion label_studio/io_storages/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def create(self, request, *args, **kwargs):
from .functions import validate_storage_instance

instance = validate_storage_instance(request, self.serializer_class)
limit = request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT)
limit = int(request.data.get('limit', settings.DEFAULT_STORAGE_LIST_LIMIT))

try:
files = []
Expand Down
7 changes: 6 additions & 1 deletion label_studio/io_storages/proxy_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,12 @@ def prepare_headers(self, response, metadata, request, project):
if metadata.get('ContentRange'):
response.headers['Content-Range'] = metadata['ContentRange']
if metadata.get('LastModified'):
response.headers['Last-Modified'] = metadata['LastModified'].strftime('%a, %d %b %Y %H:%M:%S GMT')
last_mod = metadata['LastModified']
# Accept either datetime-like (has strftime) or preformatted string
if hasattr(last_mod, 'strftime'):
response.headers['Last-Modified'] = last_mod.strftime('%a, %d %b %Y %H:%M:%S GMT')
else:
response.headers['Last-Modified'] = str(last_mod)

# Always enable range requests
response.headers['Accept-Ranges'] = 'bytes'
Expand Down
6 changes: 3 additions & 3 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ dependencies = [
"tldextract (>=5.1.3)",
"uuid-utils (>=0.11.0,<1.0.0)",
## HumanSignal repo dependencies :start
"label-studio-sdk @ https://github.com/HumanSignal/label-studio-sdk/archive/228d91101d156d61c2d6334cc6cc0bfe1a79c254.zip",
"label-studio-sdk @ https://github.com/HumanSignal/label-studio-sdk/archive/77b0c0abd2847c914096e6054b6f1b1805ee1b7a.zip",
## HumanSignal repo dependencies :end
]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -187,11 +187,11 @@ export const useStorageApi = ({ target, storage, project, onSubmit, onClose }: U

if (isDefined(storage?.id)) {
body.id = storage.id;
body.limit = 30;
}

return api.callApi<{ files: any[] }>("storageFiles", {
params: {
limit: 10,
target,
type: previewData.provider,
},
Expand Down
1 change: 1 addition & 0 deletions web/libs/ui/src/assets/icons/cloud-provider-databricks.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions web/libs/ui/src/assets/icons/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -261,3 +261,4 @@ export { ReactComponent as IconCloudProviderS3 } from "./cloud-provider-s3.svg";
export { ReactComponent as IconCloudProviderRedis } from "./cloud-provider-redis.svg";
export { ReactComponent as IconCloudProviderGCS } from "./cloud-provider-gcs.svg";
export { ReactComponent as IconCloudProviderAzure } from "./cloud-provider-azure.svg";
export { ReactComponent as IconCloudProviderDatabricks } from "./cloud-provider-databricks.svg";
Loading