Receive: queries blocked during startup while uploading blocks to object storage

## What happened

We're running Thanos Receive with object storage enabled, and after a restart (or a crash recovery) it can take a long time before Query can actually reach it. During that window, the receive instance is marked as not ready and Query just skips it entirely — so we get gaps in query results.

The problem gets worse the more tenants and unuploaded blocks you have sitting on disk. In our case it can be several hours of downtime after every pod restart.

## What I found analyzing code with IA:

There are actually two places where upload blocks readiness:

### 1) Initial sync before the run group even starts

In `cmd/thanos/receive.go` around line 762:

```go
level.Info(logger).Log("msg", "upload enabled, starting initial sync")
if err := upload(context.Background()); err != nil {
    return errors.Wrap(err, "initial upload failed")
}
```

This calls `SyncAllTenants()` synchronously. If there are a lot of blocks to upload (e.g. from a previous unclean shutdown), this takes forever. And it's fatal — if it fails, receive doesn't start at all.

### 2) First hashring load triggers flush + upload + wait

When the first hashring config arrives (which always happens on startup), the code does:

```go
statusProber.NotReady(...)
dbs.Flush()           // compacts all heads into blocks
dbs.Open()            // reopens TSDBs
uploadC <- struct{}{} // trigger upload
<-uploadDone          // wait for it
statusProber.Ready()  // only now we're ready
```

So receive won't set itself ready until all freshly-flushed blocks are uploaded too. The gRPC health check returns `NOT_SERVING`, and the Info API returns an error, so Query drops this endpoint from its set.

The thing is — the data is still there locally in TSDB. Receive could serve queries from local storage just fine while uploading in the background. The upload is about durability (getting blocks to object storage), not about being able to answer queries.

## What I'd like to see changed

The general idea: **decouple block upload from readiness**. Receive should become ready to serve queries as soon as the TSDBs are opened, and upload should happen in the background.

Concretely:

1. **Make the initial sync non-blocking (and non-fatal).** Move it into the run group as a background task, or at least don't gate startup on it. If some blocks fail to upload on first try, the periodic uploader (every 30s) will pick them up anyway.

2. **Don't wait for upload to complete on hashring change.** After `dbs.Flush()` + `dbs.Open()`, set Ready immediately and let the upload happen asynchronously. The data is on local disk, queries work fine.

3. **Consider making the initial sync failure non-fatal.** Right now if the initial `SyncAllTenants()` fails, receive refuses to start. That seems overly strict — a transient object storage issue shouldn't prevent the instance from serving traffic.

An alternative (smaller change): add a flag like `--receive.upload-on-startup=false` that lets operators skip the synchronous initial upload entirely, relying on the periodic uploader instead.

## Environment

- Thanos version: 0.42.0-dev (main branch)
- Object storage: S3 (but not specific to any backend)
- Multiple tenants, ~hundreds of blocks on disk after unclean restart

## Relevant code

- `cmd/thanos/receive.go` — `startTSDBAndUpload()` function, initial sync around L762, hashring change handler around L717
- `pkg/receive/multitsdb.go` — `SyncAllTenants()`, `startPeriodicUploader()`
- `pkg/shipper/shipper.go` — `Sync()` method
- `pkg/prober/grpc.go` — `NotReady()` sets gRPC health to `NOT_SERVING`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive: queries blocked during startup while uploading blocks to object storage #8776

What happened

What I found analyzing code with IA:

1) Initial sync before the run group even starts

2) First hashring load triggers flush + upload + wait

What I'd like to see changed

Environment

Relevant code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Receive: queries blocked during startup while uploading blocks to object storage #8776

Description

What happened

What I found analyzing code with IA:

1) Initial sync before the run group even starts

2) First hashring load triggers flush + upload + wait

What I'd like to see changed

Environment

Relevant code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions