Skip to content

Commit c7ab62b

Browse files
TocarIPcopybara-github
authored andcommitted
Optimize proto verification
This implements following optimizations 1)Try to avoid conditional move in PushLimit - branchis well predicited and this allows us to shorten the critical path. ~1-2% improvement 2) Split tags and verify func into 2 separate tables, this saves spaces (we avoid padding in the table) making it more cache efficient, and makes tag search potentially vectorizable. ~1-5% improvement. 3) Fully unroll DiscardVarint. This makes it easier for branch predictor by splitting different branches and allows cpu to speculate past data dependency since we have clear next p (current + constant) ~4% speed-up 4) Adds a fast path for 1-byte tag + 1-byte varint 5) Replaces switch on rotated value with switch + nested ifs - helps branch predictor escpcially with fdo and also cuts down critical path, since msb calculation and switch can be performed in parallel. 6) Restructures the loop by adding inner loop that doesn't call functions this improves register allocation for the fast loop and doesn't affect slow cases like messages. Results: AMD (milan) is 20% faster: BM_V1VerifyViewAll/10 3.604µ ± 1% 2.882µ ± 0% -20.04% (p=0.000 n=20) BM_V1VerifyViewAll/100 3.741µ ± 1% 2.994µ ± 1% -19.97% (p=0.000 n=20) BM_V1VerifyViewAll/1000 3.798µ ± 1% 3.062µ ± 1% -19.37% (p=0.000 n=20) BM_V1VerifyCordAll/10 3.688µ ± 0% 2.963µ ± 0% -19.65% (p=0.000 n=20) BM_V1VerifyCordAll/100 3.837µ ± 1% 3.048µ ± 1% -20.57% (p=0.000 n=20) BM_V1VerifyCordAll/1000 3.894µ ± 0% 3.152µ ± 0% -19.06% (p=0.000 n=20) geomean 3.759µ 3.016µ -19.78% Intel (skylake) is slightly faster, but I think we are running out of cpu width? BM_V1VerifyViewAll/10 5.002µ ± 1% 4.840µ ± 1% -3.24% (p=0.006 n=20) BM_V1VerifyViewAll/100 5.068µ ± 2% 4.912µ ± 3% -3.09% (p=0.012 n=20) BM_V1VerifyViewAll/1000 5.129µ ± 1% 4.954µ ± 1% -3.40% (p=0.000 n=20) BM_V1VerifyCordAll/10 5.105µ ± 2% 4.937µ ± 1% -3.29% (p=0.004 n=20) BM_V1VerifyCordAll/100 5.131µ ± 1% 4.999µ ± 5% -2.57% (p=0.035 n=20) BM_V1VerifyCordAll/1000 5.411µ ± 4% 5.079µ ± 3% -6.13% (p=0.000 n=20) geomean 5.139µ 4.953µ -3.63% PiperOrigin-RevId: 939990208
1 parent e8c74e1 commit c7ab62b

1 file changed

Lines changed: 12 additions & 3 deletions

File tree

src/google/protobuf/parse_context.h

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,11 @@ class PROTOBUF_EXPORT EpsCopyInputStream {
173173
// This add is safe due to the invariant above, because
174174
// ptr - buffer_end_ <= kSlopBytes.
175175
limit += static_cast<int>(ptr - buffer_end_);
176-
limit_end_ = buffer_end_ + (std::min)(0, limit);
176+
if (ABSL_PREDICT_TRUE(limit <= 0)) {
177+
limit_end_ = buffer_end_ + limit;
178+
} else {
179+
limit_end_ = buffer_end_;
180+
}
177181
auto old_limit = limit_;
178182
limit_ = limit;
179183
return LimitToken(old_limit - limit);
@@ -182,11 +186,16 @@ class PROTOBUF_EXPORT EpsCopyInputStream {
182186
[[nodiscard]] bool PopLimit(LimitToken delta) {
183187
// We must update the limit first before the early return. Otherwise, we can
184188
// end up with an invalid limit and it can lead to integer overflows.
185-
limit_ = limit_ + std::move(delta).token();
189+
int old_limit = limit_ + std::move(delta).token();
190+
limit_ = old_limit;
186191
if (ABSL_PREDICT_FALSE(!EndedAtLimit())) return false;
187192
// TODO We could remove this line and hoist the code to
188193
// DoneFallback. Study the perf/bin-size effects.
189-
limit_end_ = buffer_end_ + (std::min)(0, limit_);
194+
if (ABSL_PREDICT_TRUE(old_limit <= 0)) {
195+
limit_end_ = buffer_end_ + old_limit;
196+
} else {
197+
limit_end_ = buffer_end_;
198+
}
190199
return true;
191200
}
192201

0 commit comments

Comments
 (0)