-
Notifications
You must be signed in to change notification settings - Fork 36
Implement Small-Value Sum-Check Optimization (Algorithm 6) #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wu-s-john
wants to merge
58
commits into
microsoft:main
Choose a base branch
from
wu-s-john:feat/procedure-9-accumulator
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6) Introduces RoundAccumulator and SmallValueAccumulators for the small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage with const generic D for cache efficiency and vectorizable merge operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and LagrangeEvaluatedMultilinearPolynomial with const generic D to enable: - Compile-time enforcement that domain types match accumulator degree - Debug assertions for bounds checking (v < D in constructors) - Elimination of runtime base parameter from to_flat_index() This prevents mixing domain sizes at compile time and catches out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends prefixes to the Ud domain, applies the ∞/Cz rule, and routes contributions via cached idx4 with E_in/E_out weighting. Expanded accumulator tests with a naive cross-check, ∞ handling, and binary-β zero behavior to validate correctness. Cleaned up dead-code allowances now that the code paths are used.
Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.
indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value optimization for first ℓ₀ rounds with eq-poly optimization for remaining - Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree parameter - Add sumcheck_sweep.rs examples for performance comparison
build_accumulators The new from_boolean_evals_with_buffer_reusing method takes caller-provided scratch buffers and alternates between them during extension. This reduces allocations from O(num_x_in × num_x_out) per call to O(num_threads) buffers allocated once per thread.
variants
Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and prove_cubic_with_three_inputs_small_value produce identical proofs when used with a real SHA256 circuit (Algorithm 6 validation). Changes: - Add PartialEq, Eq derive to SumcheckProof for proof comparison - Add extract_outer_sumcheck_inputs helper to SpartanSNARK - Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.
Key changes:
- Add SmallValueField trait for type-safe i32/i64 small-value
operations
- Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
- Add SpartanAccumulatorInput trait to unify field and i32 witness
handling
- Make LagrangeEvaluatedMultilinearPolynomial generic over element
type
- Update sumcheck prover to accept separate i32 witness polynomials
- Clean up MultilinearPolynomial<i32>: remove unused
from_u32/from_u64/from_field
2828f04 to
67674c4
Compare
evaluations
Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and UdHatPoint - Move test-only methods (alpha, prefix_len, suffix_len, extend_from_boolean) to cfg(test) impl blocks - Add CachedPrefixIndex struct with From impl to accumulator_index.rs - Remove unused QuadraticTAccumulatorPrefixIndex type alias - Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in build_accumulators_spartan and build_accumulators. Previously, 5 vectors were allocated on every x_out iteration; now allocations happen once per Rayon thread subdivision. - Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids .to_vec()) - Add SpartanThreadState and GenericThreadState structs for buffer reuse - Extract thread state structs to thread_state_accumulators module Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to its own module for better code organization. Rename from SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify that it abstracts over multilinear polynomial representations (field elements vs small values).
Replace per-iteration modular reductions with accumulated wide-integer arithmetic, reducing once per beta instead of once per x_in iteration. Key changes: - Add WideLimbs<N> for wide unsigned integer arithmetic (6/8 limbs) - Refactor SmallValueField to be generic over small value type (i32/i64) - Add UnreducedMontInt types for delayed reduction in Montgomery form - Replace SpartanAccumulatorInputPolynomial with MatVecMLE trait - Optimize eq polynomial table computation (1 mul instead of 2 per element) - Update benchmark to compare i32/i64 vs i64/i128 variants
- Add mac() helper for fused multiply-accumulate, eliminating temporary arrays in unreduced_mont_int_mul_add (4 implementations) - Subtract in limb space before reduction via sub_mag(), saving one Barrett reduction per signed accumulator - Replace large e_out tables with JIT-computed eyx scratch buffers, reducing eq table memory 7× and improving cache locality - Add unreduced_is_zero() fast path to skip expensive modular reduction - Precompute betas_with_infty indices to avoid filter in inner loop - Use barrett_reduce_6_* directly for i128 products instead of padding to 8 limbs (saves 8 wasted multiplications per isl_mul call)
propagation Replace mac(acc, 0, 0, carry) calls with simple overflowing_add to avoid unnecessary u128 multiply-add pipeline for pure carry propagation. Also add #[inline(always)] to hot path functions to ensure full inlining.
- Apply rustfmt formatting fixes in accumulators.rs - Fix clippy manual_is_multiple_of warning in test code
Introduce circuit gadgets optimized for small-value sumcheck optimization: - SmallMultiEq: Batches equality constraints with bounded coefficients, flushing at MAX_COEFF_BITS (31) instead of bellpepper's ~237. This keeps constraint coefficients within i32 bounds for the small-value optimization. - SmallUInt32: 32-bit unsigned integer gadget using SmallMultiEq for carry constraints in addmany operations. - small_sha256: SHA-256 implementation using the above gadgets, producing circuits where Az and Bz values fit in i32. - Update sumcheck_sha256_equivalence example to use bellpepper's Circuit trait for constraint counting, comparing SmallSha256 vs bellpepper SHA-256. The tradeoff: SmallSha256 generates ~17% more R1CS constraints due to more frequent MultiEq flushing, but enables the small-value sumcheck optimization. Add 16-bit limbed addition for i32 small-value optimization SmallUInt32::addmany produces coefficients up to 2^34, exceeding i32 bounds. Splitting into 16-bit limbs reduces max coefficient to 2^18, enabling i32/i64 small-value sumcheck for SHA-256. - Add SmallValueConfig trait with Small32 (i32/i64) and Small64 (i64/i128) - Implement addmany_limbed using two constraints per addition - Update SmallMultiEq to be generic over config - Fix example to use config-specific bounds check
- Add examples/sha256_chain_benchmark.rs comparing original vs small-value sumcheck performance on SHA-256 hash chains - CSV output includes witness synthesis time, sumcheck times, speedup, and witness percentage of total proving time - CLI support: single <num_vars> for profiling, range-sweep for benchmarks - Add small_sha256_with_prefix() for chaining multiple SHA-256 hashes with unique constraint namespaces - Fix SmallValueField<i64> generic in lagrange.rs - Fix unused variable warning in msm.rs
Split SmallValueField into two traits for better separation of concerns: - SmallValueField: core small-value operations (ss_mul, sl_mul, isl_mul) - DelayedReduction: unreduced accumulator operations for hot paths Rename types for clarity: - UnreducedMontInt → UnreducedFieldInt (field × integer products) - UnreducedMontMont → UnreducedFieldField (field × field products) Add FieldReductionConstants trait to deduplicate Barrett/Montgomery reduction: - Consolidates Fp/Fq constants (MODULUS, R256-R512, MONT_INV) - Generic reduction functions monomorphized at compile time for zero overhead - Comprehensive documentation explaining R constants (2^k mod p) Performance and cleanup: - Add ext_buf_idx scratch buffer to avoid Vec allocation in accumulator hot loop - Remove unused OrderedVariable from shape_cs modules (~140 lines) - Remove unused build_univariate_round_evals from sumcheck (~40 lines) - Add log2_constraints column to benchmark CSV output
Split the 2,367-line small_field.rs into a proper module structure: - small_field/small_value_field.rs: SmallValueField trait - small_field/delayed_reduction.rs: DelayedReduction trait - small_field/barrett.rs: Barrett/Montgomery reduction functions - small_field/impls.rs: Fp/Fq implementations and tests - small_field/mod.rs: re-exports and helper functions Moved batching configuration types (NoBatching, Batching<K>, BatchingMode, SmallMultiEqConfig, I32NoBatch, I64Batch21) from small_field to gadgets/small_multi_eq.rs where they logically belong, since they're specifically for constraint batching in SmallMultiEq. Added detailed documentation for I64Batch21 explaining why K=21 is the safe maximum: with SHA-256-like circuits having ~200 terms and 2^34 positional coefficients, batching 21 constraints keeps the worst-case magnitude (2^62) under the i64 signed limit (2^63).
contributions Refactors shared logic between Spartan and generic accumulator builders.
878e7b0 to
406b59e
Compare
Improves type safety and self-documentation by replacing (bool, [u64; N]) with an explicit enum indicating whether the result is positive (a >= b) or negative (a < b).
Move wide_limbs.rs content and limb arithmetic from barrett.rs into a unified small_field/limbs.rs module for delayed modular reduction.
Split monolithic lagrange.rs (1667 lines) into focused submodules:
- domain.rs: LagrangePoint, LagrangeHatPoint, LagrangeIndex
- evals.rs: LagrangeEvals, LagrangeHatEvals
- basis.rs: LagrangeBasisFactory, LagrangeCoeff
- extension.rs: LagrangeEvaluatedMultilinearPolynomial
- accumulator.rs: RoundAccumulator, LagrangeAccumulators
- accumulator_builder.rs: build_accumulators_spartan,
build_accumulators
Consolidate related files into the module:
- accumulator_index.rs → index.rs
- thread_state_accumulators.rs → thread_state.rs
- eq_linear.rs → eq_round.rs
Simplify extend_in_place API: use std::mem::swap to ensure result is
always in first buffer, eliminating conditional buffer selection at
call sites. Rename buf_a/b to buf_curr/scratch for clarity.
- Refactor SmallMultiEq from struct to trait with NoBatchEq and
BatchingEq<K> implementations
- Add addmany module with limbed (i32) and full (i64) addition
algorithms
- Deduplicate SHA-256 circuits into examples/circuits/sha256/ module
- Update small_uint32 and small_sha256 to use SmallMultiEq trait
phase - Extend MatVecMLE trait with UnreducedFieldField type for F×F accumulation - Add unreduced bucket accumulators to SpartanThreadState - Replace eyx precomputation with direct e_y access and z_beta = ex * tA_red - Keep unreduced across all x_out iterations and merge without reduction - Pre-compute beta values to eliminate closure overhead in scatter loop - Final Montgomery reduction only once per bucket after thread merge This reduces Montgomery reductions from ~7000+ per x_out to ~26 total for typical parameters (l0=3, 128 x_outs).
savings Replace asymmetric l/2 split with balanced ceil/floor split. This reduces precomputation cost (e.g., 36→24 for l=10, l0=3), enables odd number of rounds, and improves cache utilization by making e_xout smaller.
between runs Wraps each benchmark in its own scope block so large polynomial vectors are dropped before the next benchmark starts. Reduces peak memory from ~78GB to ~26GB for num_vars=28. Also removes the unnecessary even num_vars constraint from Algorithm 6.
Implement Spartan Engine trait for BN254 (alt_bn128) including: - Barrett reduction constants and field operations for BN254 Fr - SmallValueField<i64> and DelayedReduction<i64> implementations - Bn254Engine with Hyrax PCS and Keccak256 transcript
- Add --field flag to select Pallas, Vesta, or BN254 curves - Add --methods flag to choose benchmarks (base, i32, i64) - Add --trials flag for multiple runs per num_vars - Separate setup and prove timings in CSV output - Make benchmarks generic over Engine type
Move beta_values Vec from per-iteration allocation to thread-local buffer in SpartanThreadState. Reduces allocations from O(num_x_out) to O(num_threads) in the scatter phase.
stack array - Use Vec::with_capacity(num_rounds) for r and polys vectors - Replace heap-allocated vec![...] with stack array for per-round evals
Add bind_three_polys_top helper that binds three polynomials together, reducing Rayon dispatches from 2 to 0-1 per round and using serial fallback for small polynomials (n < 4096) to avoid scheduling overhead.
Uses two-phase wide-limb accumulation to reduce Montgomery reductions
from O(2^k) to O(2^{k/2}) per round in
prove_cubic_with_three_inputs_small_value. Exploits split-eq
factorization E[id] = E_out[x_out] * E_in[x_in].
Extract extend_single and extend_batch4 helpers to process 4 suffix elements together, enabling instruction-level parallelism on AArch64.
Adds prove_cubic_with_three_inputs_split_eq_delayed to measure the effect of delayed modular reduction in eq polynomial evaluation separately from small-value accumulator precomputation.
Introduces DelayedModularReductionMode trait with
DelayedModularReductionEnabled and DelayedModularReductionDisabled
marker types for zero-cost compile-time strategy selection. This enables
benchmarking DMR speedup without runtime branching overhead.
Key changes:
- Add delay_modular_reduction_mode.rs with AccumulateProduct and Mode
traits
- Make RoundAccumulator/LagrangeAccumulators generic over element type
- Simplify MatVecMLE: move accumulation logic to Mode trait
- Add accum_bench example for DMR comparison benchmarks
- Make lagrange_sumcheck module public for external benchmarking
Author
|
@microsoft-github-policy-service agree |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement Small-Value Sum-Check Optimization (Algorithm 6)
Summary
This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.
Key Insight
In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.
Multiplication Cost Hierarchy:
For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.
Benchmarks
Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with
jemalloc.Note:
halo2curves/asmis not enabled (unavailable on Apple Silicon).Headline Result: 1.83× Speedup on BN254
At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):
Scaling Across Problem Sizes
Key observations:
Delayed Modular Reduction Impact
To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.
Accumulator Building Phase
The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.
Key observations:
Time Breakdown: First l0 Rounds vs Remaining Rounds
With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:
Key observations:
Split-Eq Sumcheck with DMR
For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.
Key observations:
SHA-256 Chain Benchmark
To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.
Key observations:
Solana Light Client Comparison
A Solana light client verifying block finality requires:
SHA-256 equivalent cost:
Implementation
Core Components
SmallValueFieldtrait (src/small_field.rs)SmallValue(i32) andIntermediateSmallValue(i64) typessl_mulandisl_mulfor BN254/BLS12-381 (~3× faster than ll)Lagrange Domain Extension (
src/lagrange.rs)LagrangeEvaluatedMultilinearPolynomial<T, D>for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}extend_in_placewith ping-pong buffersgather_prefix_evalsfor efficient prefix collection (Procedure 6)Accumulator Data Structures (
src/accumulators.rs,src/accumulator_index.rs)SmallValueAccumulators<S, D>storing A_i(v, u) with O(1) indexing viaUdTupleidx4mapping (Definition A.5) for distributing products to correct accumulatorsUdEvaluationsandUdHatEvaluationswrappersProcedure 9 Implementation (
src/accumulators.rs)build_accumulators_spartan: Optimized for Spartan's Az·Bz structurebuild_accumulators: Generic version for arbitrary polynomial productsThread-Local Buffer Reuse (
src/thread_state_accumulators.rs)SpartanThreadStateandGenericThreadStateeliminate O(num_x_out) allocationsSum-Check Integration (
src/sumcheck.rs)SmallValueSumCheck::from_accumulatorsfactory methodAlgorithm Flow
Test Plan
cargo test test_build_accumulators- Verifies accumulator constructioncargo test test_small_value- SmallValueField arithmetic correctnesscargo test lagrange- Lagrange extension and interpolationcargo test sumcheck- Full sum-check protocol equivalencecargo clippy- No warningsexamples/sumcheck_sha256_equivalence.rs- Verifies new method produces identical proofs to baselineexamples/sha256_chain_benchmark.rs- SHA-256 chain proving with CSV outputReferences