Skip to content

Receive: queries blocked during startup while uploading blocks to object storage #8776

@dreiggy

Description

@dreiggy

What happened

We're running Thanos Receive with object storage enabled, and after a restart (or a crash recovery) it can take a long time before Query can actually reach it. During that window, the receive instance is marked as not ready and Query just skips it entirely — so we get gaps in query results.

The problem gets worse the more tenants and unuploaded blocks you have sitting on disk. In our case it can be several hours of downtime after every pod restart.

What I found analyzing code with IA:

There are actually two places where upload blocks readiness:

1) Initial sync before the run group even starts

In cmd/thanos/receive.go around line 762:

level.Info(logger).Log("msg", "upload enabled, starting initial sync")
if err := upload(context.Background()); err != nil {
    return errors.Wrap(err, "initial upload failed")
}

This calls SyncAllTenants() synchronously. If there are a lot of blocks to upload (e.g. from a previous unclean shutdown), this takes forever. And it's fatal — if it fails, receive doesn't start at all.

2) First hashring load triggers flush + upload + wait

When the first hashring config arrives (which always happens on startup), the code does:

statusProber.NotReady(...)
dbs.Flush()           // compacts all heads into blocks
dbs.Open()            // reopens TSDBs
uploadC <- struct{}{} // trigger upload
<-uploadDone          // wait for it
statusProber.Ready()  // only now we're ready

So receive won't set itself ready until all freshly-flushed blocks are uploaded too. The gRPC health check returns NOT_SERVING, and the Info API returns an error, so Query drops this endpoint from its set.

The thing is — the data is still there locally in TSDB. Receive could serve queries from local storage just fine while uploading in the background. The upload is about durability (getting blocks to object storage), not about being able to answer queries.

What I'd like to see changed

The general idea: decouple block upload from readiness. Receive should become ready to serve queries as soon as the TSDBs are opened, and upload should happen in the background.

Concretely:

  1. Make the initial sync non-blocking (and non-fatal). Move it into the run group as a background task, or at least don't gate startup on it. If some blocks fail to upload on first try, the periodic uploader (every 30s) will pick them up anyway.

  2. Don't wait for upload to complete on hashring change. After dbs.Flush() + dbs.Open(), set Ready immediately and let the upload happen asynchronously. The data is on local disk, queries work fine.

  3. Consider making the initial sync failure non-fatal. Right now if the initial SyncAllTenants() fails, receive refuses to start. That seems overly strict — a transient object storage issue shouldn't prevent the instance from serving traffic.

An alternative (smaller change): add a flag like --receive.upload-on-startup=false that lets operators skip the synchronous initial upload entirely, relying on the periodic uploader instead.

Environment

  • Thanos version: 0.42.0-dev (main branch)
  • Object storage: S3 (but not specific to any backend)
  • Multiple tenants, ~hundreds of blocks on disk after unclean restart

Relevant code

  • cmd/thanos/receive.gostartTSDBAndUpload() function, initial sync around L762, hashring change handler around L717
  • pkg/receive/multitsdb.goSyncAllTenants(), startPeriodicUploader()
  • pkg/shipper/shipper.goSync() method
  • pkg/prober/grpc.goNotReady() sets gRPC health to NOT_SERVING

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions