Query NASA's CMR for HLS (Harmonized Landsat Sentinel-2) satellite data and cache STAC items as GeoParquet files. Supports both local processing and AWS Lambda + Step Functions deployment.
The AWS Step Functions + Lambda pipeline writes a hive-partitioned parquet dataset following this pattern:
s3://{bucket}/{prefix}/{version}/{collection}/year={year}/month={month}/*.parquet.
git clone https://github.com/MAAP-project/hls-stac-parquet.git
cd hls-stac-parquet
uv syncTwo-step workflow for efficient data processing:
Query CMR and cache STAC JSON links for a specific day and collection:
uv run hls-stac-parquet cache-daily-stac-json-links HLSL30 2024-01-15 s3://bucket/data
# Optional: filter by bounding box (west, south, east, north)
uv run hls-stac-parquet cache-daily-stac-json-links HLSS30 2024-01-15 s3://bucket/data \
--bounding-box -100,40,-90,50Read cached links and write monthly GeoParquet files:
uv run hls-stac-parquet write-monthly-stac-geoparquet HLSL30 2024-01 s3://bucket/data
# Optional: version output and control validation
uv run hls-stac-parquet write-monthly-stac-geoparquet HLSS30 2024-01 s3://bucket/data \
--version v0.1.0 \
--no-require-complete-linksHLSL30- HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0HLSS30- HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0
s3://bucket/data/
├── links/
│ ├── HLSL30.v2.0/2024/01/2024-01-01.json
│ ├── HLSL30.v2.0/2024/01/2024-01-02.json
└── ...
└── v2/
└── HLSL30.v2.0/year=2024/month=01/HLSL30_2.0-2024-1.parquet
Deploy scalable processing infrastructure with AWS CDK:
Serverless Lambda + Step Functions:
- Cache Daily Lambda: Lightweight CMR queries (1024 MB memory, 300s timeout, max 4 concurrent)
- Write Monthly Lambda: Write monthly GeoParquet files (8192 MB memory, 15min timeout, no concurrency limit)
- Month Calculator Lambda: Generate dates array for Step Functions (128 MB memory, 30s timeout)
- Month List Generator Lambda: Generate month list for backfill workflow (128 MB memory, 30s timeout)
- Monthly Workflow State Machine: Orchestrates single month processing (cache-daily → write-monthly)
- Backfill Workflow State Machine: Orchestrates multi-month historical backfill (max 3 months in parallel)
- EventBridge Rules: Automated trigger every 5 days for both previous-month catch-up and current-month incremental updates (enabled by default)
- CloudWatch Alarms: Monitor Lambda errors and Step Functions timeouts, publishing to the SNS alert topic
- Storage: S3 bucket for cached STAC links
- Logging: CloudWatch logs for all Lambda functions and Step Functions executions
cd infrastructure
npm install && npm run build
npm run deployThe Step Functions state machine runs automatically every 5 days to keep the archive current. Each trigger fires four executions (staggered by one hour):
- 10:00 UTC — HLSL30 previous month (straggler catch-up)
- 11:00 UTC — HLSS30 previous month (straggler catch-up)
- 12:00 UTC — HLSL30 current month (incremental build)
- 13:00 UTC — HLSS30 current month (incremental build)
Each execution:
- Calculates the target month's date range
- Caches STAC links for all days in that month (parallel, max 4 concurrent)
- Writes the monthly GeoParquet file
Input Parameters:
collection(required): Either"HLSL30"or"HLSS30"yearmonth(optional): Specific month to process in format"YYYY-MM-DD"(day is ignored). If not provided, processes previous month
Note: dest and version are configured at deployment time and cannot be overridden at runtime.
Manual Invocation:
# Get the state machine ARN
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
--output text)
# Start execution - process previous month (default behavior)
aws stepfunctions start-execution \
--state-machine-arn "$STATE_MACHINE_ARN" \
--name "manual-hlsl30-$(date +%Y%m%d-%H%M%S)" \
--input '{"collection": "HLSL30"}'
# Start execution - process a specific month
aws stepfunctions start-execution \
--state-machine-arn "$STATE_MACHINE_ARN" \
--name "manual-hlsl30-2024-11-$(date +%Y%m%d-%H%M%S)" \
--input '{"collection": "HLSL30", "yearmonth": "2024-11-01"}'
# Monitor execution
EXECUTION_ARN=$(aws stepfunctions list-executions \
--state-machine-arn "$STATE_MACHINE_ARN" \
--max-results 1 \
--query 'executions[0].executionArn' \
--output text)
aws stepfunctions describe-execution --execution-arn "$EXECUTION_ARN"WARNING: The backfill workflow will query NASA's CMR API very heavily. It processes multiple months in parallel, with each month making ~30 CMR requests. Use responsibly and consider rate limiting for large historical backfills.
The backfill workflow is a parent Step Functions state machine that orchestrates the complete historical rebuild:
- Generate month list: Calculate all year-months to process for a date range
- Process months in parallel: Invoke the monthly workflow for each month (max 3 concurrent)
- Each monthly workflow: Cache all days (max 4 concurrent) → Write monthly GeoParquet
Advantages over manual batch processing:
- Infrastructure-managed, no long-running scripts
- Built-in concurrency control to protect upstream API
- Automatic retries and error handling
- Progress tracking in Step Functions console
- Can process entire collection history in one command
Input Parameters:
collection(required): Either"HLSL30"or"HLSS30"start_date(optional): ISO format date (YYYY-MM-DD). Defaults to collection origin date (HLSL30: 2013-04-01, HLSS30: 2015-11-01)end_date(optional): ISO format date (YYYY-MM-DD). Defaults to last complete month
Note: dest and version are configured at deployment time and cannot be overridden at runtime.
Running a Backfill:
# Get the backfill state machine ARN
BACKFILL_STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`BackfillStateMachineArn`].OutputValue' \
--output text)
# Backfill entire HLSL30 history (2013-04 to present)
# WARNING: This will make ~50,000 CMR API requests over several hours
aws stepfunctions start-execution \
--state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
--name "backfill-hlsl30-full-$(date +%Y%m%d-%H%M%S)" \
--input '{"collection": "HLSL30"}'
# Backfill specific date range (2020-2024)
aws stepfunctions start-execution \
--state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
--name "backfill-hlsl30-2020s-$(date +%Y%m%d-%H%M%S)" \
--input '{
"collection": "HLSL30",
"start_date": "2020-01-01",
"end_date": "2024-12-01"
}'
# Monitor execution progress
EXECUTION_ARN=$(aws stepfunctions list-executions \
--state-machine-arn "$BACKFILL_STATE_MACHINE_ARN" \
--max-results 1 \
--query 'executions[0].executionArn' \
--output text)
aws stepfunctions describe-execution --execution-arn "$EXECUTION_ARN"
# View backfill workflow logs
aws logs tail /aws/vendedlogs/states/hls-backfill-workflow --followConcurrency Settings:
- Backfill workflow: Processes 3 months concurrently
- Monthly workflow: Processes 4 days concurrently per month
- Total concurrent CMR requests: ~12 (3 months × 4 days)
- write-monthly Lambda: No concurrency limit (processes as many months as needed)
For ad-hoc processing of individual months, you can use either the monthly Step Functions workflow or invoke the write-monthly Lambda directly.
Option 1: Via Step Functions (Recommended)
Use this to process a single month including cache-daily and write-monthly:
# Process a specific month using Step Functions
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
--output text)
# This will cache all days in November 2024 and write the monthly file
aws stepfunctions start-execution \
--state-machine-arn "$STATE_MACHINE_ARN" \
--name "manual-2024-11-$(date +%Y%m%d-%H%M%S)" \
--input '{"collection": "HLSL30", "yearmonth": "2024-11-01"}'Option 2: Direct write-monthly Lambda Invocation
Use this only when cache-daily is already complete and you just need to write the parquet file:
# Get the write-monthly Lambda function name
WRITE_MONTHLY_FUNCTION=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`WriteMonthlyFunctionName`].OutputValue' \
--output text)
# Invoke for a specific month
aws lambda invoke \
--function-name "$WRITE_MONTHLY_FUNCTION" \
--payload '{"collection": "HLSL30", "yearmonth": "2024-11-01"}' \
response.json && cat response.json
# View logs
aws logs tail "/aws/lambda/$WRITE_MONTHLY_FUNCTION" --followAvailable Parameters:
collection: "HLSL30" or "HLSS30" (required)yearmonth: Format "YYYY-MM-DD" (e.g., "2024-01-01") - day is ignored (required)require_complete_links: Optional. Boolean (default: true) - require all daily cache files before processingskip_existing: Optional. Boolean (default: true) - skip if output file already existsbatch_size: Optional. Number of items per batch (default: 1000)
Note: dest and version are configured at deployment time via environment variables.
View execution status and logs:
# Get state machine ARN
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`MonthlyWorkflowStateMachineArn`].OutputValue' \
--output text)
# List recent executions
aws stepfunctions list-executions \
--state-machine-arn "$STATE_MACHINE_ARN" \
--max-results 10
# Get details of specific execution
aws stepfunctions describe-execution \
--execution-arn <execution-arn>
# View execution history (see each step)
aws stepfunctions get-execution-history \
--execution-arn <execution-arn> \
--max-results 100
# View Step Functions logs
aws logs tail /aws/vendedlogs/states/hls-monthly-workflow --followView Lambda logs:
# Cache-daily Lambda
CACHE_DAILY_FUNCTION=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`LambdaFunctionName`].OutputValue' \
--output text)
aws logs tail "/aws/lambda/$CACHE_DAILY_FUNCTION" --follow
# Write-monthly Lambda
WRITE_MONTHLY_FUNCTION=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`WriteMonthlyFunctionName`].OutputValue' \
--output text)
aws logs tail "/aws/lambda/$WRITE_MONTHLY_FUNCTION" --follow
# Check for errors
aws logs filter-events \
--log-group-name "/aws/lambda/$CACHE_DAILY_FUNCTION" \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s)000The SNS alert topic receives two types of messages:
- Step Functions notifications: The monthly workflow publishes a success message (with collection, month, and record count) on completion, and a failure message (with error details) if the write-monthly step fails.
- CloudWatch alarm notifications: Alarms for cache-daily Lambda errors and Step Functions timeouts also publish to the same topic.
Subscribe your email to receive all notifications:
# Get alert topic ARN
ALERT_TOPIC_ARN=$(aws cloudformation describe-stacks \
--stack-name HlsStacGeoparquetArchive \
--query 'Stacks[0].Outputs[?OutputKey==`AlertTopicArn`].OutputValue' \
--output text)
# Subscribe your email
aws sns subscribe \
--topic-arn "$ALERT_TOPIC_ARN" \
--protocol email \
--notification-endpoint your-email@example.com
# View CloudWatch alarm status
aws cloudwatch describe-alarms --alarm-name-prefix HlsStacGeoparquetArchivecd infrastructure
npx cdk destroy- NASA's CMR API for providing access to HLS data
- The
rustaclibrary for efficient STAC GeoParquet writing - The
obstorelibrary for high-performance object storage access