Skip to content

Contribution to accelerate python backend latency #8348

@wweic

Description

@wweic

Hello Triton team,

We would like to contribute an optimization we've developed for the Python backend that achieves significant latency reductions (~50%) for production recommendation systems through optimizing a component by 30X. We're seeking your guidance on the best way to integrate this improvement into the project.

Is your feature request related to a problem? Please describe.

In production recommendation systems and advertising models, string inputs are ubiquitous and constitute a substantial portion of the total input data. Processing these strings efficiently is crucial for model latency and throughput.

Through profiling our production models deployed via the Python backend, we discovered that approximately 50% of nv_inference_compute_input_duration_us is consumed by string input deserialization within the Python backend (specifically the deserialize_bytes_tensor function), rather than by the model computation itself. This represents a significant optimization opportunity for large-scale production Triton server deployments serving thousands of GPUs.

Describe the solution you'd like

We have developed a C++ implementation (reference implementation) for deserialize_bytes_tensor that achieves up to 30x speedup according to our micro-benchmarks. We've validated these improvements in production environments, observing approximately 50% latency reduction overall. The solution has been running in production for several weeks with stable memory usage, which was one of our primary validation criteria.

result from the micro benchmarks for string tensors of various batch size. production recommendation systems' batch size can be usually a few thousands.

[10/18] Testing: batch_1000... PASS (C++: 104μs, Python: 1249μs, Speedup: 12.01x)
[11/18] Testing: batch_5000 (size: 5000)... PASS (C++: 356μs, Python: 5859μs, Speedup: 16.46x)
[12/18] Testing: batch_15000 (size: 15000)... PASS (C++: 681μs, Python: 17438μs, Speedup: 25.61x)
[13/18] Testing: batch_15000_repeated (size: 15000)... PASS (C++: 525μs, Python: 17219μs, Speedup: 32.80x)
[14/18] Testing: batch_15000_varying (size: 15000)... PASS (C++: 2265μs, Python: 30030μs, Speedup: 13.26x)
[15/18] Testing: batch_30000 (size: 30000)... PASS (C++: 1220μs, Python: 33637μs, Speedup: 27.57x)
[16/18] Testing: batch_10000_empty (size: 10000)... PASS (C++: 214μs, Python: 10534μs, Speedup: 49.22x)
[17/18] Testing: batch_5000_large (size: 5000)... PASS (C++: 980μs, Python: 9159μs, Speedup: 9.35x)
[18/18] Testing: batch_100_mixed... PASS (C++: 44μs, Python: 385μs, Speedup: 8.75x)

Describe alternatives you've considered

We initially explored optimizing the Python implementation but were unable to achieve comparable performance gains. While the C++ implementation adds some complexity, we believe the maintenance overhead is minimal as the primary additional consideration is ensuring proper Python garbage collection behavior. Given the substantial performance benefits, we believe this trade-off is worthwhile.

Additional context

We would greatly appreciate your feedback on this optimization and guidance on integration with the Python backend repository. We're happy to adapt our implementation to meet your coding standards and testing requirements.

Our current implementation includes:

  1. The C++ optimization for deserialize_bytes_tensor
  2. Standalone tests verifying functional parity between Python and C++ implementations (test code)
  3. Micro-benchmarks demonstrating performance improvements (benchmark code)

Currently, these tests are built as separate binaries. We would appreciate your guidance on properly integrating them into the repository's CI/CD pipeline should you decide to move forward with this contribution.

Thank you for considering this optimization. We look forward to working with you to improve Triton's performance for the community.

tagging folks who recently committed to python backend for awareness and feedbacks. @mattwittwer @yinggeh @mc-nv @pskiran1 @kthui @rmccorm4

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceA possible performance tune-up

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions