-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Hello Triton team,
We would like to contribute an optimization we've developed for the Python backend that achieves significant latency reductions (~50%) for production recommendation systems through optimizing a component by 30X. We're seeking your guidance on the best way to integrate this improvement into the project.
Is your feature request related to a problem? Please describe.
In production recommendation systems and advertising models, string inputs are ubiquitous and constitute a substantial portion of the total input data. Processing these strings efficiently is crucial for model latency and throughput.
Through profiling our production models deployed via the Python backend, we discovered that approximately 50% of nv_inference_compute_input_duration_us
is consumed by string input deserialization within the Python backend (specifically the deserialize_bytes_tensor
function), rather than by the model computation itself. This represents a significant optimization opportunity for large-scale production Triton server deployments serving thousands of GPUs.
Describe the solution you'd like
We have developed a C++ implementation (reference implementation) for deserialize_bytes_tensor
that achieves up to 30x speedup according to our micro-benchmarks. We've validated these improvements in production environments, observing approximately 50% latency reduction overall. The solution has been running in production for several weeks with stable memory usage, which was one of our primary validation criteria.
result from the micro benchmarks for string tensors of various batch size. production recommendation systems' batch size can be usually a few thousands.
[10/18] Testing: batch_1000... PASS (C++: 104μs, Python: 1249μs, Speedup: 12.01x)
[11/18] Testing: batch_5000 (size: 5000)... PASS (C++: 356μs, Python: 5859μs, Speedup: 16.46x)
[12/18] Testing: batch_15000 (size: 15000)... PASS (C++: 681μs, Python: 17438μs, Speedup: 25.61x)
[13/18] Testing: batch_15000_repeated (size: 15000)... PASS (C++: 525μs, Python: 17219μs, Speedup: 32.80x)
[14/18] Testing: batch_15000_varying (size: 15000)... PASS (C++: 2265μs, Python: 30030μs, Speedup: 13.26x)
[15/18] Testing: batch_30000 (size: 30000)... PASS (C++: 1220μs, Python: 33637μs, Speedup: 27.57x)
[16/18] Testing: batch_10000_empty (size: 10000)... PASS (C++: 214μs, Python: 10534μs, Speedup: 49.22x)
[17/18] Testing: batch_5000_large (size: 5000)... PASS (C++: 980μs, Python: 9159μs, Speedup: 9.35x)
[18/18] Testing: batch_100_mixed... PASS (C++: 44μs, Python: 385μs, Speedup: 8.75x)
Describe alternatives you've considered
We initially explored optimizing the Python implementation but were unable to achieve comparable performance gains. While the C++ implementation adds some complexity, we believe the maintenance overhead is minimal as the primary additional consideration is ensuring proper Python garbage collection behavior. Given the substantial performance benefits, we believe this trade-off is worthwhile.
Additional context
We would greatly appreciate your feedback on this optimization and guidance on integration with the Python backend repository. We're happy to adapt our implementation to meet your coding standards and testing requirements.
Our current implementation includes:
- The C++ optimization for
deserialize_bytes_tensor
- Standalone tests verifying functional parity between Python and C++ implementations (test code)
- Micro-benchmarks demonstrating performance improvements (benchmark code)
Currently, these tests are built as separate binaries. We would appreciate your guidance on properly integrating them into the repository's CI/CD pipeline should you decide to move forward with this contribution.
Thank you for considering this optimization. We look forward to working with you to improve Triton's performance for the community.
tagging folks who recently committed to python backend for awareness and feedbacks. @mattwittwer @yinggeh @mc-nv @pskiran1 @kthui @rmccorm4