Skip to content

/resolved endpoint is the primary scaling bottleneck #482

@thehabes

Description

@thehabes

Summary

The `GET /project/{id}/page/{pageId}/resolved` endpoint is the single largest performance bottleneck in the TPEN3 stack. It triggers a deep dependency chain across all three services and degrades rapidly under concurrent load. RERUM fails under the amplified load, cascading 502s through the stack.

How it works

When called, `/resolved` fetches the page data, then makes N parallel RERUM requests (one per page item/annotation) to resolve each annotation's full content. For a page with 700 items, that's 700 outgoing HTTP requests fanning out through TinyPEN to RERUM.

Evidence

From load testing (Run 4):

Phase VUs /resolved p95 Impact
Phase 1 (baseline) 1 2.8-4.5s Slow even at single user
Phase 2 (load) 20 ~4.7s Drives overall read p95 above threshold
Phase 3 (stress) 50+ 14.5s+ Causes queuing that affects ALL endpoints
Phase 4 (spike) 80 25s+ Queue saturation, 54s timeouts at 150 VUs

Key observations:

  • At 80+ VUs, `/resolved` causes a cascading queue that blocks unrelated endpoints (simple GETs that would normally take <100ms start taking 5-10s)
  • The page used for testing has grown to 700+ items across test runs. Each run adds lines via load/stress/conflict phases, making `/resolved` progressively slower (Finding /save high level route and test #15 in test plan)
  • This is the primary reason the stack can't handle 30+ concurrent users

Recommendations

In priority order:

  1. Cache resolved responses — Even a 30-second TTL cache would dramatically reduce RERUM load. Annotations rarely change between page loads.

  2. Add pagination — Return resolved items in pages (e.g., 50 at a time) so the client can progressively render rather than waiting for all 700 to resolve.

  3. Batch RERUM lookups — Instead of N individual `/id/{id}` requests, use a single query like `POST /query` with `{"@id": {"$in": [...ids]}}` to fetch many annotations at once.

  4. Limit concurrent outgoing requests — If N=700, don't fire all 700 simultaneously. Use a concurrency pool (e.g., 20 at a time) to avoid overwhelming RERUM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions