Skip to content

refactor(experimental): reuse HTTP clients, add response models, and parallelize ops#1253

Merged
garrett4wade merged 1 commit intoinclusionAI:mainfrom
guozhihao-224:refactor/inference-service-http-perf
Apr 27, 2026
Merged

refactor(experimental): reuse HTTP clients, add response models, and parallelize ops#1253
garrett4wade merged 1 commit intoinclusionAI:mainfrom
guozhihao-224:refactor/inference-service-http-perf

Conversation

@guozhihao-224
Copy link
Copy Markdown
Collaborator

@guozhihao-224 guozhihao-224 commented Apr 24, 2026

Description

Refactor the inference service HTTP layer to reuse long-lived httpx clients instead of creating per-request clients. Adds Pydantic response models across all services for type safety, and parallelizes sequential operations (health checks, proxy registrations, broadcasts) for better throughput.

Related Issue

Fixes #1217

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

  • Controller: shared httpx.Client/AsyncClient, idempotent destroy(), parallel proxy registration via ThreadPoolExecutor, parallel set_version/pause/continue via asyncio.gather
  • Gateway: shared AsyncClient via lifespan, _use_client() context manager in streaming module, parallel data proxy registration
  • Router: shared AsyncClient, parallel health checks via asyncio.gather, proper lifespan cleanup with try/finally
  • Data proxy: Pydantic response models, shared client for non-streaming requests, parallel callback delivery, proper InfBridge cleanup on shutdown and backend reconfiguration
  • InfBridge: shared AsyncClient with aclose() lifecycle method

Files changed:

  • areal/experimental/inference_service/controller/controller.py
  • areal/experimental/inference_service/data_proxy/app.py
  • areal/experimental/inference_service/data_proxy/pause.py
  • areal/experimental/inference_service/gateway/app.py
  • areal/experimental/inference_service/gateway/streaming.py
  • areal/experimental/inference_service/inf_bridge.py
  • areal/experimental/inference_service/router/app.py
  • tests/experimental/inference_service/test_data_proxy_chat.py

…parallelize ops in inference service

Replace per-request httpx/requests client creation with shared long-lived
clients across the inference service stack (controller, gateway, router,
data proxy, InfBridge). This eliminates repeated TCP connection setup and
TLS handshake overhead on every API call.

Key changes:
- Controller: shared httpx.Client/AsyncClient, idempotent destroy()
- Gateway: shared AsyncClient via lifespan, _use_client() helper in streaming
- Router: shared AsyncClient, parallel health checks via asyncio.gather
- Data proxy: Pydantic response models, shared client, parallel callbacks
- InfBridge: shared AsyncClient with proper aclose() lifecycle
- Parallelize: proxy registration, set_version, pause/continue broadcasts
- Add Pydantic BaseModel response types across all services for type safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@guozhihao-224 guozhihao-224 force-pushed the refactor/inference-service-http-perf branch from 4215028 to e5e9e94 Compare April 24, 2026 10:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the inference service components to use shared httpx clients instead of creating new ones per request or using the requests library. It introduces Pydantic response models for better API documentation and validation across the controller, data proxy, gateway, and router. Additionally, several operations have been parallelized using asyncio.gather or ThreadPoolExecutor to improve performance, including worker health checks, callback deliveries, and version updates. Feedback includes a recommendation to avoid catching BaseException to prevent interfering with task cancellation, and suggestions to reuse shared httpx clients in contexts where they were incorrectly assumed to be thread-unsafe or incompatible with streaming.

config.request_timeout,
client=http_client,
)
except BaseException as exc:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching BaseException is generally discouraged as it includes system-level exceptions like KeyboardInterrupt and SystemExit. More importantly, in an async context, it will catch asyncio.CancelledError, which can prevent tasks from being cancelled properly and lead to unexpected behavior or hangs during shutdown. Use Exception instead.

Suggested change
except BaseException as exc:
except Exception as exc:

Comment on lines +687 to +695
# Each thread gets its own httpx.Client because httpx.Client
# is not thread-safe and must not be shared across threads.
with httpx.Client() as client:
resp = client.post(
f"{router_addr}/register",
json={"worker_addr": data_proxy_addr},
headers={"Authorization": f"Bearer {admin_key}"},
timeout=5,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment stating that httpx.Client is not thread-safe is incorrect. According to the httpx documentation, both Client and AsyncClient are thread-safe and intended to be shared across threads and tasks. Creating a new client per thread is unnecessary and inefficient. You should use the existing self._sync_client instance instead.

            # httpx.Client is thread-safe and can be shared across threads.
            resp = self._sync_client.post(
                f"{router_addr}/register",
                json={"worker_addr": data_proxy_addr},
                headers={"Authorization": f"Bearer {admin_key}"},
                timeout=5,
            )

try:
async with httpx.AsyncClient(timeout=httpx.Timeout(timeout)) as client:
async with client.stream(
async with httpx.AsyncClient(timeout=httpx.Timeout(timeout)) as c:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The note above (lines 452-454) claiming that streaming requires a per-request client is incorrect. httpx.AsyncClient supports multiple concurrent streams using the same client instance. Reusing the shared client via the client parameter (using the _use_client helper already defined in this file) would be more efficient and consistent with the PR's objectives.

Suggested change
async with httpx.AsyncClient(timeout=httpx.Timeout(timeout)) as c:
async with _use_client(client, timeout) as c:

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 8cc52ba into inclusionAI:main Apr 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(inference_service): HTTP client reuse, parallelization, and response models

2 participants