Optimize proto verification

TocarIP · copybara-github · commit c7ab62b3315f · 2026-06-29T13:10:30.000-07:00
This implements following optimizations
1)Try to avoid conditional move in PushLimit - branchis well predicited
and this allows us to shorten the critical path. ~1-2% improvement
2) Split tags and verify func into 2 separate tables, this saves spaces
(we avoid padding in the table) making it more cache efficient, and makes
tag search potentially vectorizable. ~1-5% improvement.
3) Fully unroll DiscardVarint. This makes it easier for branch predictor
by splitting different branches and allows cpu to speculate past data
dependency since we have clear next p (current + constant) ~4% speed-up
4) Adds a fast path for 1-byte tag + 1-byte varint
5) Replaces switch on rotated value with switch + nested ifs - helps
branch predictor escpcially with fdo and also cuts down critical path,
since msb calculation and switch can be performed in parallel.
6) Restructures the loop by adding inner loop that doesn't call functions
this improves register allocation for the fast loop and doesn't affect
slow cases like messages.

Results:

AMD (milan) is 20% faster:
BM_V1VerifyViewAll/10   3.604µ ± 1%   2.882µ ± 0%  -20.04% (p=0.000 n=20)
BM_V1VerifyViewAll/100  3.741µ ± 1%   2.994µ ± 1%  -19.97% (p=0.000 n=20)
BM_V1VerifyViewAll/1000 3.798µ ± 1%   3.062µ ± 1%  -19.37% (p=0.000 n=20)
BM_V1VerifyCordAll/10   3.688µ ± 0%   2.963µ ± 0%  -19.65% (p=0.000 n=20)
BM_V1VerifyCordAll/100  3.837µ ± 1%   3.048µ ± 1%  -20.57% (p=0.000 n=20)
BM_V1VerifyCordAll/1000 3.894µ ± 0%   3.152µ ± 0%  -19.06% (p=0.000 n=20)
geomean                 3.759µ        3.016µ       -19.78%

Intel (skylake) is slightly faster, but I think we are running out of cpu width?
BM_V1VerifyViewAll/10   5.002µ ± 1%   4.840µ ± 1%  -3.24% (p=0.006 n=20)
BM_V1VerifyViewAll/100  5.068µ ± 2%   4.912µ ± 3%  -3.09% (p=0.012 n=20)
BM_V1VerifyViewAll/1000 5.129µ ± 1%   4.954µ ± 1%  -3.40% (p=0.000 n=20)
BM_V1VerifyCordAll/10   5.105µ ± 2%   4.937µ ± 1%  -3.29% (p=0.004 n=20)
BM_V1VerifyCordAll/100  5.131µ ± 1%   4.999µ ± 5%  -2.57% (p=0.035 n=20)
BM_V1VerifyCordAll/1000 5.411µ ± 4%   5.079µ ± 3%  -6.13% (p=0.000 n=20)
geomean                 5.139µ        4.953µ       -3.63%

PiperOrigin-RevId: 939990208
diff --git a/src/google/protobuf/parse_context.h b/src/google/protobuf/parse_context.h
@@ -173,7 +173,11 @@ class PROTOBUF_EXPORT EpsCopyInputStream {
     // This add is safe due to the invariant above, because
     // ptr - buffer_end_ <= kSlopBytes.
     limit += static_cast<int>(ptr - buffer_end_);
-    limit_end_ = buffer_end_ + (std::min)(0, limit);
+    if (ABSL_PREDICT_TRUE(limit <= 0)) {
+      limit_end_ = buffer_end_ + limit;
+    } else {
+      limit_end_ = buffer_end_;
+    }
     auto old_limit = limit_;
     limit_ = limit;
     return LimitToken(old_limit - limit);
@@ -182,11 +186,16 @@ class PROTOBUF_EXPORT EpsCopyInputStream {
   [[nodiscard]] bool PopLimit(LimitToken delta) {
     // We must update the limit first before the early return. Otherwise, we can
     // end up with an invalid limit and it can lead to integer overflows.
-    limit_ = limit_ + std::move(delta).token();
+    int old_limit = limit_ + std::move(delta).token();
+    limit_ = old_limit;
     if (ABSL_PREDICT_FALSE(!EndedAtLimit())) return false;
     // TODO We could remove this line and hoist the code to
     // DoneFallback. Study the perf/bin-size effects.
-    limit_end_ = buffer_end_ + (std::min)(0, limit_);
+    if (ABSL_PREDICT_TRUE(old_limit <= 0)) {
+      limit_end_ = buffer_end_ + old_limit;
+    } else {
+      limit_end_ = buffer_end_;
+    }
     return true;
   }