HLS STAC Parquet

Query NASA's CMR for HLS (Harmonized Landsat Sentinel-2) satellite data and cache STAC items as GeoParquet files. Supports both local processing and AWS Lambda + Step Functions deployment.

The AWS Step Functions + Lambda pipeline writes a hive-partitioned parquet dataset following this pattern: s3://{bucket}/{prefix}/{version}/{collection}/year={year}/month={month}/*.parquet.

Development

git clone https://github.com/MAAP-project/hls-stac-parquet.git
cd hls-stac-parquet

uv sync

CLI Usage

Two-step workflow for efficient data processing:

1. Cache Daily STAC Links

Query CMR and cache STAC JSON links for a specific day and collection:

uv run hls-stac-parquet cache-daily-stac-json-links HLSL30 2024-01-15 s3://bucket/data

# Optional: filter by bounding box (west, south, east, north)
uv run hls-stac-parquet cache-daily-stac-json-links HLSS30 2024-01-15 s3://bucket/data \
  --bounding-box -100,40,-90,50

2. Write Monthly GeoParquet

Read cached links and write monthly GeoParquet files:

uv run hls-stac-parquet write-monthly-stac-geoparquet HLSL30 2024-01 s3://bucket/data

# Optional: version output and control validation
uv run hls-stac-parquet write-monthly-stac-geoparquet HLSS30 2024-01 s3://bucket/data \
  --version v0.1.0 \
  --no-require-complete-links

Collections

Output Structure

s3://bucket/data/
├── links/
│   ├── HLSL30.v2.0/2024/01/2024-01-01.json
│   ├── HLSL30.v2.0/2024/01/2024-01-02.json
    └── ...
└── v2/
    └── HLSL30.v2.0/year=2024/month=01/HLSL30_2.0-2024-1.parquet

AWS Deployment

Deploy scalable processing infrastructure with AWS CDK:

Architecture

Serverless Lambda + Step Functions:

Cache Daily Lambda: Lightweight CMR queries (1024 MB memory, 300s timeout, max 4 concurrent)
Write Monthly Lambda: Write monthly GeoParquet files (8192 MB memory, 15min timeout, no concurrency limit)
Month Calculator Lambda: Generate dates array for Step Functions (128 MB memory, 30s timeout)
Month List Generator Lambda: Generate month list for backfill workflow (128 MB memory, 30s timeout)
Monthly Workflow State Machine: Orchestrates single month processing (cache-daily → write-monthly)
Backfill Workflow State Machine: Orchestrates multi-month historical backfill (max 3 months in parallel)
EventBridge Rules: Automated trigger every 5 days for both previous-month catch-up and current-month incremental updates (enabled by default)
CloudWatch Alarms: Monitor Lambda errors and Step Functions timeouts, publishing to the SNS alert topic
Storage: S3 bucket for cached STAC links
Logging: CloudWatch logs for all Lambda functions and Step Functions executions

Deployment

cd infrastructure
npm install && npm run build
npm run deploy

Running Jobs

Automated Monthly Workflow (Step Functions)

The Step Functions state machine runs automatically every 5 days to keep the archive current. Each trigger fires four executions (staggered by one hour):

10:00 UTC — HLSL30 previous month (straggler catch-up)
11:00 UTC — HLSS30 previous month (straggler catch-up)
12:00 UTC — HLSL30 current month (incremental build)
13:00 UTC — HLSS30 current month (incremental build)

Each execution:

Calculates the target month's date range
Caches STAC links for all days in that month (parallel, max 4 concurrent)
Writes the monthly GeoParquet file

Input Parameters:

collection (required): Either "HLSL30" or "HLSS30"
yearmonth (optional): Specific month to process in format "YYYY-MM-DD" (day is ignored). If not provided, processes previous month

Note: dest and version are configured at deployment time and cannot be overridden at runtime.

Manual Invocation:

# Get the state machine ARN
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
  --output text)

# Start execution - process previous month (default behavior)
aws stepfunctions start-execution \
  --state-machine-arn "$STATE_MACHINE_ARN" \
  --name "manual-hlsl30-$(date +%Y%m%d-%H%M%S)" \
  --input '{"collection": "HLSL30"}'

# Start execution - process a specific month
aws stepfunctions start-execution \
  --state-machine-arn "$STATE_MACHINE_ARN" \
  --name "manual-hlsl30-2024-11-$(date +%Y%m%d-%H%M%S)" \
  --input '{"collection": "HLSL30", "yearmonth": "2024-11-01"}'

# Monitor execution
EXECUTION_ARN=$(aws stepfunctions list-executions \
  --state-machine-arn "$STATE_MACHINE_ARN" \
  --max-results 1 \
  --query 'executions[0].executionArn' \
  --output text)

aws stepfunctions describe-execution --execution-arn "$EXECUTION_ARN"

Historical Backfills (Backfill Workflow)

WARNING: The backfill workflow will query NASA's CMR API very heavily. It processes multiple months in parallel, with each month making ~30 CMR requests. Use responsibly and consider rate limiting for large historical backfills.

The backfill workflow is a parent Step Functions state machine that orchestrates the complete historical rebuild:

Generate month list: Calculate all year-months to process for a date range
Process months in parallel: Invoke the monthly workflow for each month (max 3 concurrent)
Each monthly workflow: Cache all days (max 4 concurrent) → Write monthly GeoParquet

Advantages over manual batch processing:

Infrastructure-managed, no long-running scripts
Built-in concurrency control to protect upstream API
Automatic retries and error handling
Progress tracking in Step Functions console
Can process entire collection history in one command

Input Parameters:

collection (required): Either "HLSL30" or "HLSS30"
start_date (optional): ISO format date (YYYY-MM-DD). Defaults to collection origin date (HLSL30: 2013-04-01, HLSS30: 2015-11-01)
end_date (optional): ISO format date (YYYY-MM-DD). Defaults to last complete month

Note: dest and version are configured at deployment time and cannot be overridden at runtime.

Running a Backfill:

# Get the backfill state machine ARN
BACKFILL_STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`BackfillStateMachineArn`].OutputValue' \
  --output text)

# Backfill entire HLSL30 history (2013-04 to present)
# WARNING: This will make ~50,000 CMR API requests over several hours
aws stepfunctions start-execution \
  --state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
  --name "backfill-hlsl30-full-$(date +%Y%m%d-%H%M%S)" \
  --input '{"collection": "HLSL30"}'

# Backfill specific date range (2020-2024)
aws stepfunctions start-execution \
  --state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
  --name "backfill-hlsl30-2020s-$(date +%Y%m%d-%H%M%S)" \
  --input '{
    "collection": "HLSL30",
    "start_date": "2020-01-01",
    "end_date": "2024-12-01"
  }'

# Monitor execution progress
EXECUTION_ARN=$(aws stepfunctions list-executions \
  --state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
  --max-results 1 \
  --query 'executions[0].executionArn' \
  --output text)

aws stepfunctions describe-execution --execution-arn "$EXECUTION_ARN"

# View backfill workflow logs
aws logs tail /aws/vendedlogs/states/hls-backfill-workflow --follow

Concurrency Settings:

Backfill workflow: Processes 3 months concurrently
Monthly workflow: Processes 4 days concurrently per month
Total concurrent CMR requests: ~12 (3 months × 4 days)
write-monthly Lambda: No concurrency limit (processes as many months as needed)

Manual Single-Month Processing

For ad-hoc processing of individual months, you can use either the monthly Step Functions workflow or invoke the write-monthly Lambda directly.

Option 1: Via Step Functions (Recommended)

Use this to process a single month including cache-daily and write-monthly:

# Process a specific month using Step Functions
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
  --output text)

# This will cache all days in November 2024 and write the monthly file
aws stepfunctions start-execution \
  --state-machine-arn "$STATE_MACHINE_ARN" \
  --name "manual-2024-11-$(date +%Y%m%d-%H%M%S)" \
  --input '{"collection": "HLSL30", "yearmonth": "2024-11-01"}'

Option 2: Direct write-monthly Lambda Invocation

Use this only when cache-daily is already complete and you just need to write the parquet file:

# Get the write-monthly Lambda function name
WRITE_MONTHLY_FUNCTION=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`WriteMonthlyFunctionName`].OutputValue' \
  --output text)

# Invoke for a specific month
aws lambda invoke \
  --function-name "$WRITE_MONTHLY_FUNCTION" \
  --payload '{"collection": "HLSL30", "yearmonth": "2024-11-01"}' \
  response.json && cat response.json

# View logs
aws logs tail "/aws/lambda/$WRITE_MONTHLY_FUNCTION" --follow

Available Parameters:

collection: "HLSL30" or "HLSS30" (required)
yearmonth: Format "YYYY-MM-DD" (e.g., "2024-01-01") - day is ignored (required)
require_complete_links: Optional. Boolean (default: true) - require all daily cache files before processing
skip_existing: Optional. Boolean (default: true) - skip if output file already exists
batch_size: Optional. Number of items per batch (default: 1000)

Note: dest and version are configured at deployment time via environment variables.

Monitoring

Step Functions Workflow

View execution status and logs:

# Get state machine ARN
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
  --output text)

# List recent executions
aws stepfunctions list-executions \
  --state-machine-arn "$STATE_MACHINE_ARN" \
  --max-results 10

# Get details of specific execution
aws stepfunctions describe-execution \
  --execution-arn <execution-arn>

# View execution history (see each step)
aws stepfunctions get-execution-history \
  --execution-arn <execution-arn> \
  --max-results 100

# View Step Functions logs
aws logs tail /aws/vendedlogs/states/hls-monthly-workflow --follow

Lambda Functions

View Lambda logs:

# Cache-daily Lambda
CACHE_DAILY_FUNCTION=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`LambdaFunctionName`].OutputValue' \
  --output text)
aws logs tail "/aws/lambda/$CACHE_DAILY_FUNCTION" --follow

# Write-monthly Lambda
WRITE_MONTHLY_FUNCTION=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`WriteMonthlyFunctionName`].OutputValue' \
  --output text)
aws logs tail "/aws/lambda/$WRITE_MONTHLY_FUNCTION" --follow

# Check for errors
aws logs filter-events \
  --log-group-name "/aws/lambda/$CACHE_DAILY_FUNCTION" \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s)000

SNS Notifications

The SNS alert topic receives two types of messages:

Step Functions notifications: The monthly workflow publishes a success message (with collection, month, and record count) on completion, and a failure message (with error details) if the write-monthly step fails.
CloudWatch alarm notifications: Alarms for cache-daily Lambda errors and Step Functions timeouts also publish to the same topic.

Subscribe your email to receive all notifications:

# Get alert topic ARN
ALERT_TOPIC_ARN=$(aws cloudformation describe-stacks \
  --stack-name HlsStacGeoparquetArchive \
  --query 'Stacks[0].Outputs[?OutputKey==`AlertTopicArn`].OutputValue' \
  --output text)

# Subscribe your email
aws sns subscribe \
  --topic-arn "$ALERT_TOPIC_ARN" \
  --protocol email \
  --notification-endpoint your-email@example.com

# View CloudWatch alarm status
aws cloudwatch describe-alarms --alarm-name-prefix HlsStacGeoparquetArchive

Cleanup

cd infrastructure
npx cdk destroy

Acknowledgments

NASA's CMR API for providing access to HLS data
The rustac library for efficient STAC GeoParquet writing
The obstore library for high-performance object storage access

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
docs		docs
infrastructure		infrastructure
src/hls_stac_parquet		src/hls_stac_parquet
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLS STAC Parquet

Development

CLI Usage

1. Cache Daily STAC Links

2. Write Monthly GeoParquet

Collections

Output Structure

AWS Deployment

Architecture

Deployment

Running Jobs

Automated Monthly Workflow (Step Functions)

Historical Backfills (Backfill Workflow)

Manual Single-Month Processing

Monitoring

Step Functions Workflow

Lambda Functions

SNS Notifications

Cleanup

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HLS STAC Parquet

Development

CLI Usage

1. Cache Daily STAC Links

2. Write Monthly GeoParquet

Collections

Output Structure

AWS Deployment

Architecture

Deployment

Running Jobs

Automated Monthly Workflow (Step Functions)

Historical Backfills (Backfill Workflow)

Manual Single-Month Processing

Monitoring

Step Functions Workflow

Lambda Functions

SNS Notifications

Cleanup

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages