Contribution to accelerate python backend latency

Hello Triton team,

We would like to contribute an optimization we've developed for the Python backend that achieves significant latency reductions (~50%) for production recommendation systems through optimizing a component by 30X. We're seeking your guidance on the best way to integrate this improvement into the project.

**Is your feature request related to a problem? Please describe.**

In production recommendation systems and advertising models, string inputs are ubiquitous and constitute a substantial portion of the total input data. Processing these strings efficiently is crucial for model latency and throughput. 

Through profiling our production models deployed via the Python backend, we discovered that approximately 50% of `nv_inference_compute_input_duration_us` is consumed by string input deserialization within the Python backend (specifically the [`deserialize_bytes_tensor`](https://github.com/triton-inference-server/python_backend/blob/8b5a055e6f2cdd22cf6d3644e5822b01ffc62ce1/src/pb_tensor.cc#L168-L170) function), rather than by the model computation itself. This represents a significant optimization opportunity for large-scale production Triton server deployments serving thousands of GPUs.

**Describe the solution you'd like**

We have developed a C++ implementation ([reference implementation](https://github.com/wweic/python_backend/commit/f5259b02d8a582acb0834c4ed0dd78643305c0ff)) for `deserialize_bytes_tensor` that achieves up to 30x speedup according to our [micro-benchmarks](https://github.com/wweic/python_backend/commit/f12ee38f814fdacade0cc9325676befc42bc5e31). We've validated these improvements in production environments, observing approximately 50% latency reduction overall. The solution has been running in production for several weeks with stable memory usage, which was one of our primary validation criteria.

result from the micro benchmarks for string tensors of various batch size. production recommendation systems' batch size can be usually a few thousands.
```bash
[10/18] Testing: batch_1000... PASS (C++: 104μs, Python: 1249μs, Speedup: 12.01x)
[11/18] Testing: batch_5000 (size: 5000)... PASS (C++: 356μs, Python: 5859μs, Speedup: 16.46x)
[12/18] Testing: batch_15000 (size: 15000)... PASS (C++: 681μs, Python: 17438μs, Speedup: 25.61x)
[13/18] Testing: batch_15000_repeated (size: 15000)... PASS (C++: 525μs, Python: 17219μs, Speedup: 32.80x)
[14/18] Testing: batch_15000_varying (size: 15000)... PASS (C++: 2265μs, Python: 30030μs, Speedup: 13.26x)
[15/18] Testing: batch_30000 (size: 30000)... PASS (C++: 1220μs, Python: 33637μs, Speedup: 27.57x)
[16/18] Testing: batch_10000_empty (size: 10000)... PASS (C++: 214μs, Python: 10534μs, Speedup: 49.22x)
[17/18] Testing: batch_5000_large (size: 5000)... PASS (C++: 980μs, Python: 9159μs, Speedup: 9.35x)
[18/18] Testing: batch_100_mixed... PASS (C++: 44μs, Python: 385μs, Speedup: 8.75x)
```

**Describe alternatives you've considered**

We initially explored optimizing the Python implementation but were unable to achieve comparable performance gains. While the C++ implementation adds some complexity, we believe the maintenance overhead is minimal as the primary additional consideration is ensuring proper Python garbage collection behavior. Given the substantial performance benefits, we believe this trade-off is worthwhile.

**Additional context**

We would greatly appreciate your feedback on this optimization and guidance on integration with the Python backend repository. We're happy to adapt our implementation to meet your coding standards and testing requirements.

Our current implementation includes:
1. The C++ optimization for `deserialize_bytes_tensor`
2. Standalone tests verifying functional parity between Python and C++ implementations ([test code](https://github.com/wweic/python_backend/commit/f12ee38f814fdacade0cc9325676befc42bc5e31))
3. Micro-benchmarks demonstrating performance improvements ([benchmark code](https://github.com/wweic/python_backend/commit/f12ee38f814fdacade0cc9325676befc42bc5e31))

Currently, these tests are built as separate binaries. We would appreciate your guidance on properly integrating them into the repository's CI/CD pipeline should you decide to move forward with this contribution.

Thank you for considering this optimization. We look forward to working with you to improve Triton's performance for the community. 

tagging folks who recently committed to python backend for awareness and feedbacks. @mattwittwer @yinggeh @mc-nv  @pskiran1   @kthui @rmccorm4  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Contribution to accelerate python backend latency #8348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Contribution to accelerate python backend latency #8348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions