Implement Small-Value Sum-Check Optimization (Algorithm 6) #98

wu-s-john · 2025-12-18T23:02:41Z

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.

Key Insight

In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.

Multiplication Cost Hierarchy:

ss (small × small): Native i32/i64 multiplication (~1 cycle)
sl (small × large): Barrett-optimized multiplication (~9 base mults)
ll (large × large): Full Montgomery multiplication (~32 base mults)

For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.

Benchmarks

Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with jemalloc.
Note: halo2curves/asm is not enabled (unavailable on Apple Silicon).

Headline Result: 1.83× Speedup on BN254

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):

Percentile	Base prove (ms)	Small-value prove (ms)	Speedup
25%	2,588	1,418	1.79×
50% (median)	2,616	1,429	1.82×
75%	2,662	1,473	1.84×
90%	2,895	1,577	1.88×

Mean speedup: 1.83× (base: 2,671ms → small-value: 1,464ms)
Lower variance in optimized version (std: 73ms vs 151ms)
Consistent speedup across all percentiles

Scaling Across Problem Sizes

cargo run --release --example sumcheck_sweep -- --field bn254-fr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	small-value prove (µs)	speedup
16	65,536	9,459	4,619	2.05×
17	131,072	8,795	4,516	1.95×
18	262,144	13,582	7,567	1.80×
19	524,288	24,293	14,199	1.71×
20	1,048,576	49,350	25,117	1.97×
21	2,097,152	96,088	62,382	1.54×
22	4,194,304	176,852	94,437	1.87×
23	8,388,608	349,647	190,513	1.84×
24	16,777,216	679,342	365,720	1.86×
25	33,554,432	1,409,206	729,882	1.93×
26	67,108,864	2,799,654	1,493,126	1.88×
27	134,217,728	5,671,207	3,066,167	1.85×

Key observations:

Consistent 1.8-2.0× speedup across all problem sizes on BN254
Speedup remains stable even at n = 2²⁷ (134M constraints)
Peak speedup of 2.05× at n = 2¹⁶

Delayed Modular Reduction Impact

To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.

Accumulator Building Phase

The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.

cargo run --release --example accum_bench -- --field bn254-fr --l0 3 range-sweep --min 16 --max 27

num_vars	n	DMR accum (µs)	no-DMR accum (µs)	accum speedup
16	65,536	1,058	3,983	3.76×
17	131,072	1,595	8,668	5.43×
18	262,144	2,434	18,357	7.54×
19	524,288	3,831	15,732	4.11×
20	1,048,576	7,708	29,625	3.84×
21	2,097,152	13,739	55,788	4.06×
22	4,194,304	25,401	120,680	4.75×
23	8,388,608	55,910	223,901	4.00×
24	16,777,216	116,574	458,829	3.94×
25	33,554,432	242,699	935,360	3.85×
26	67,108,864	516,107	1,891,151	3.66×
27	134,217,728	1,024,774	3,794,770	3.70×

Key observations:

DMR provides 3.7-7.5× speedup on accumulator building with l0=3
Peak speedup of 7.54× at n = 2¹⁸
This is the primary source of performance gains in the small-value optimization

Time Breakdown: First l0 Rounds vs Remaining Rounds

With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:

num_vars	n	first l0 (ms)	remaining (ms)	ratio (first:remaining)
16	65,536	1.2	3.4	0.34 : 1
17	131,072	1.7	2.8	0.60 : 1
18	262,144	2.5	5.0	0.50 : 1
19	524,288	3.9	10.3	0.38 : 1
20	1,048,576	7.8	17.3	0.45 : 1
21	2,097,152	13.8	48.6	0.28 : 1
22	4,194,304	25.5	68.9	0.37 : 1
23	8,388,608	56.0	134.5	0.42 : 1
24	16,777,216	116.6	249.1	0.47 : 1
25	33,554,432	242.8	487.1	0.50 : 1
26	67,108,864	516.2	977.0	0.53 : 1
27	134,217,728	1,024.8	2,041.3	0.50 : 1

Key observations:

First l0 rounds (accumulators + l0 round proofs) take ~1/3 to 1/2 of total prove time
Remaining rounds (l0+1 to n) dominate, taking ~2× longer than first l0
Ratio stabilizes around 1:2 for large instances (n ≥ 2²⁴)
This balanced split indicates l0=3 is a good choice for BN254

Split-Eq Sumcheck with DMR

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.

cargo run --release --example sumcheck_sweep -- --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq-DMR prove (µs)	prove speedup
16	65,536	9,546	5,113	1.87×
17	131,072	8,278	5,698	1.45×
18	262,144	13,570	9,973	1.36×
19	524,288	26,253	18,446	1.42×
20	1,048,576	44,574	41,793	1.07×
21	2,097,152	95,211	86,503	1.10×
22	4,194,304	172,328	136,639	1.26×
23	8,388,608	327,351	271,947	1.20×
24	16,777,216	630,354	525,370	1.20×
25	33,554,432	1,255,856	1,058,761	1.19×
26	67,108,864	2,506,240	2,171,037	1.15×
27	134,217,728	4,904,540	5,304,525	0.93×

Key observations:

Split-eq with DMR provides 1.15-1.87× speedup for most instance sizes
At n = 2²⁷, there is a slight slowdown (0.93×), likely due to increased memory pressure from DMR state
The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.87×
For very large instances (n ≥ 2²⁵), the speedup stabilizes around 1.15-1.19×

SHA-256 Chain Benchmark

To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.

cargo run --release --no-default-features --example sha256_chain_benchmark

chain_length	num_vars	log₂(constraints)	num_constraints	witness_ms	orig_sumcheck_ms	small_sumcheck_ms	total_ms	speedup	witness_pct
2	16	16	65,536	14	5	3	20	1.67×	70.0%
8	18	18	262,144	55	16	11	75	1.45×	73.3%
32	20	20	1,048,576	229	48	32	301	1.50×	76.1%
128	22	22	4,194,304	1,260	163	109	1,547	1.50×	81.4%
512	24	24	16,777,216	5,686	609	395	6,743	1.54×	84.3%
2048	26	26	67,108,864	17,015	2,857	1,677	22,116	1.70×	76.9%

Key observations:

2048 SHA-256 hashes proven in ~22 seconds
Witness generation dominates at 70-84% of total proving time
Small-value sumcheck achieves consistent 1.45-1.70× speedup

Solana Light Client Comparison

A Solana light client verifying block finality requires:

Component	Hash Function	Count
Vote signature verification	SHA-512 (Ed25519 internal)	~21 to ~1,588
Merkle shred verification	SHA-256	~108 to ~1,206

Ed25519 uses SHA-512 internally for challenge hashing
Finality requires ≥2/3 supermajority stake (~21-530 validators)
SHA-512 is ~1.5-2× more expensive than SHA-256 per hash

SHA-256 equivalent cost:

Solana SHA-256: ~1,206 hashes
Solana SHA-512: ~1,588 × 1.5-2 = ~2,382-3,176 SHA-256 equivalent
Total: ~3,588-4,382 SHA-256 equivalent
Our 2048-chain benchmark covers ~47-57% of Solana's worst-case proving requirement

Implementation

Core Components

SmallValueField trait (src/small_field.rs)
- Defines SmallValue (i32) and IntermediateSmallValue (i64) types
- Barrett-optimized sl_mul and isl_mul for BN254/BLS12-381 (~3× faster than ll)
- Overflow analysis ensuring correctness for typical witness bounds
Lagrange Domain Extension (src/lagrange.rs)
- LagrangeEvaluatedMultilinearPolynomial<T, D> for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}
- Zero-allocation extend_in_place with ping-pong buffers
- gather_prefix_evals for efficient prefix collection (Procedure 6)
Accumulator Data Structures (src/accumulators.rs, src/accumulator_index.rs)
- SmallValueAccumulators<S, D> storing A_i(v, u) with O(1) indexing via UdTuple
- idx4 mapping (Definition A.5) for distributing products to correct accumulators
- Type-safe UdEvaluations and UdHatEvaluations wrappers
Procedure 9 Implementation (src/accumulators.rs)
- build_accumulators_spartan: Optimized for Spartan's Az·Bz structure
- build_accumulators: Generic version for arbitrary polynomial products
- Parallel fold-reduce with thread-local scratch buffers
Thread-Local Buffer Reuse (src/thread_state_accumulators.rs)
- SpartanThreadState and GenericThreadState eliminate O(num_x_out) allocations
- Reduces allocator contention in parallel workloads
Sum-Check Integration (src/sumcheck.rs)
- SmallValueSumCheck::from_accumulators factory method
- Round-by-round Lagrange coefficient multiplication (R_{i+1} = R_i ⊗ L_{U_d}(r_i))

Algorithm Flow

┌─────────────────────────────────────────────────────────────────────────┐
│  Precomputation: Build accumulators A_i(v, u) for i ∈ [ℓ₀]              │
│                                                                         │
│  For each x_out ∈ {0,1}^{ℓ/2-ℓ₀}:                                       │
│    For each x_in ∈ {0,1}^{ℓ/2}:                                         │
│      ein = eq(w_R, x_in) · eq(w_L, x_out)                              │
│      Extend Az/Bz prefixes to U_d^{ℓ₀} via Lagrange                    │
│      Accumulate products weighted by ein into A_i(v, u)                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Rounds 1..ℓ₀: Compute s_i(X) = ⟨R_i, A_i(·, u)⟩ for u ∈ Û_d           │
│                R_{i+1} = R_i ⊗ (L_{U_d,k}(r_i))_{k∈U_d}                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Round ℓ₀+1: Streaming round (Algorithm 2) to bind to r_{1:ℓ₀}         │
│  Rounds ℓ₀+2..ℓ: Standard linear-time sum-check (Algorithm 1)          │
└─────────────────────────────────────────────────────────────────────────┘

Test Plan

cargo test test_build_accumulators - Verifies accumulator construction
cargo test test_small_value - SmallValueField arithmetic correctness
cargo test lagrange - Lagrange extension and interpolation
cargo test sumcheck - Full sum-check protocol equivalence
cargo clippy - No warnings
examples/sumcheck_sha256_equivalence.rs - Verifies new method produces identical proofs to baseline
examples/sha256_chain_benchmark.rs - SHA-256 chain proving with CSV output

References

Paper: Speeding Up Sum-Check Proving (ePrint 2024/1046)
Jolt integration: a16z/jolt#690

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.

Implements LagrangeEvaluatedMultilinearPolynomial with from_multilinear() factory method that extends evaluations from {0,1}^n to U_d^n.

sumcheck optimization (Algorithm 6) Introduces RoundAccumulator and SmallValueAccumulators for the small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage with const generic D for cache efficiency and vectorizable merge operations in parallel fold-reduce.

Parameterize UdPoint, UdHatPoint, UdTuple, and LagrangeEvaluatedMultilinearPolynomial with const generic D to enable: - Compile-time enforcement that domain types match accumulator degree - Debug assertions for bounds checking (v < D in constructors) - Elimination of runtime base parameter from to_flat_index() This prevents mixing domain sizes at compile time and catches out-of-bounds errors in debug builds.

Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.

Extracts strided polynomial evaluations for all binary prefixes b ∈ {0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6 (Lagrange extension).

Added a parallel build_accumulators that binds suffixes, extends prefixes to the Ud domain, applies the ∞/Cz rule, and routes contributions via cached idx4 with E_in/E_out weighting. Expanded accumulator tests with a naive cross-check, ∞ handling, and binary-β zero behavior to validate correctness. Cleaned up dead-code allowances now that the code paths are used.

Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.

Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds

Replace range-indexed loops and a redundant closure with iterator forms

Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.

Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.

indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.

Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators

- Add prove_cubic_with_three_inputs_small_value combining small-value optimization for first ℓ₀ rounds with eq-poly optimization for remaining - Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree parameter - Add sumcheck_sweep.rs examples for performance comparison

build_accumulators The new from_boolean_evals_with_buffer_reusing method takes caller-provided scratch buffers and alternates between them during extension. This reduces allocations from O(num_x_in × num_x_out) per call to O(num_threads) buffers allocated once per thread.

variants Spartan version (D=2) skips binary betas since satisfying witnesses have Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial products.

Adds a new example that tests prove_cubic_with_three_inputs and prove_cubic_with_three_inputs_small_value produce identical proofs when used with a real SHA256 circuit (Algorithm 6 validation). Changes: - Add PartialEq, Eq derive to SumcheckProof for proof comparison - Add extract_outer_sumcheck_inputs helper to SpartanSNARK - Add examples/sumcheck_sha256_equivalence.rs

Implement the small × large multiplication optimization from "Speeding Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive field multiplication. Key changes: - Add SmallValueField trait for type-safe i32/i64 small-value operations - Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul) - Add SpartanAccumulatorInput trait to unify field and i32 witness handling - Make LagrangeEvaluatedMultilinearPolynomial generic over element type - Update sumcheck prover to accept separate i32 witness polynomials - Clean up MultilinearPolynomial<i32>: remove unused from_u32/from_u64/from_field

evaluations Replace raw arrays and ad-hoc structs with proper abstractions for U_d = {∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove EqRoundValues in favor of UdEvaluations<F, 2>.

- Delete unused constructor/predicate methods from UdPoint and UdHatPoint - Move test-only methods (alpha, prefix_len, suffix_len, extend_from_boolean) to cfg(test) impl blocks - Add CachedPrefixIndex struct with From impl to accumulator_index.rs - Remove unused QuadraticTAccumulatorPrefixIndex type alias - Delete unused eq_factor_alpha method from sumcheck

Hoist scratch buffers to thread-local state in build_accumulators_spartan and build_accumulators. Previously, 5 vectors were allocated on every x_out iteration; now allocations happen once per Rayon thread subdivision. - Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids .to_vec()) - Add SpartanThreadState and GenericThreadState structs for buffer reuse - Extract thread state structs to thread_state_accumulators module Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).

Move the witness polynomial abstraction trait from accumulators.rs to its own module for better code organization. Rename from SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify that it abstracts over multilinear polynomial representations (field elements vs small values).

Replace per-iteration modular reductions with accumulated wide-integer arithmetic, reducing once per beta instead of once per x_in iteration. Key changes: - Add WideLimbs<N> for wide unsigned integer arithmetic (6/8 limbs) - Refactor SmallValueField to be generic over small value type (i32/i64) - Add UnreducedMontInt types for delayed reduction in Montgomery form - Replace SpartanAccumulatorInputPolynomial with MatVecMLE trait - Optimize eq polynomial table computation (1 mul instead of 2 per element) - Update benchmark to compare i32/i64 vs i64/i128 variants

- Add mac() helper for fused multiply-accumulate, eliminating temporary arrays in unreduced_mont_int_mul_add (4 implementations) - Subtract in limb space before reduction via sub_mag(), saving one Barrett reduction per signed accumulator - Replace large e_out tables with JIT-computed eyx scratch buffers, reducing eq table memory 7× and improving cache locality - Add unreduced_is_zero() fast path to skip expensive modular reduction - Precompute betas_with_infty indices to avoid filter in inner loop - Use barrett_reduce_6_* directly for i128 products instead of padding to 8 limbs (saves 8 wasted multiplications per isl_mul call)

propagation Replace mac(acc, 0, 0, carry) calls with simple overflowing_add to avoid unnecessary u128 multiply-add pipeline for pure carry propagation. Also add #[inline(always)] to hot path functions to ensure full inlining.

- Apply rustfmt formatting fixes in accumulators.rs - Fix clippy manual_is_multiple_of warning in test code

Introduce circuit gadgets optimized for small-value sumcheck optimization: - SmallMultiEq: Batches equality constraints with bounded coefficients, flushing at MAX_COEFF_BITS (31) instead of bellpepper's ~237. This keeps constraint coefficients within i32 bounds for the small-value optimization. - SmallUInt32: 32-bit unsigned integer gadget using SmallMultiEq for carry constraints in addmany operations. - small_sha256: SHA-256 implementation using the above gadgets, producing circuits where Az and Bz values fit in i32. - Update sumcheck_sha256_equivalence example to use bellpepper's Circuit trait for constraint counting, comparing SmallSha256 vs bellpepper SHA-256. The tradeoff: SmallSha256 generates ~17% more R1CS constraints due to more frequent MultiEq flushing, but enables the small-value sumcheck optimization. Add 16-bit limbed addition for i32 small-value optimization SmallUInt32::addmany produces coefficients up to 2^34, exceeding i32 bounds. Splitting into 16-bit limbs reduces max coefficient to 2^18, enabling i32/i64 small-value sumcheck for SHA-256. - Add SmallValueConfig trait with Small32 (i32/i64) and Small64 (i64/i128) - Implement addmany_limbed using two constraints per addition - Update SmallMultiEq to be generic over config - Fix example to use config-specific bounds check

- Add examples/sha256_chain_benchmark.rs comparing original vs small-value sumcheck performance on SHA-256 hash chains - CSV output includes witness synthesis time, sumcheck times, speedup, and witness percentage of total proving time - CLI support: single <num_vars> for profiling, range-sweep for benchmarks - Add small_sha256_with_prefix() for chaining multiple SHA-256 hashes with unique constraint namespaces - Fix SmallValueField<i64> generic in lagrange.rs - Fix unused variable warning in msm.rs

Split SmallValueField into two traits for better separation of concerns: - SmallValueField: core small-value operations (ss_mul, sl_mul, isl_mul) - DelayedReduction: unreduced accumulator operations for hot paths Rename types for clarity: - UnreducedMontInt → UnreducedFieldInt (field × integer products) - UnreducedMontMont → UnreducedFieldField (field × field products) Add FieldReductionConstants trait to deduplicate Barrett/Montgomery reduction: - Consolidates Fp/Fq constants (MODULUS, R256-R512, MONT_INV) - Generic reduction functions monomorphized at compile time for zero overhead - Comprehensive documentation explaining R constants (2^k mod p) Performance and cleanup: - Add ext_buf_idx scratch buffer to avoid Vec allocation in accumulator hot loop - Remove unused OrderedVariable from shape_cs modules (~140 lines) - Remove unused build_univariate_round_evals from sumcheck (~40 lines) - Add log2_constraints column to benchmark CSV output

Split the 2,367-line small_field.rs into a proper module structure: - small_field/small_value_field.rs: SmallValueField trait - small_field/delayed_reduction.rs: DelayedReduction trait - small_field/barrett.rs: Barrett/Montgomery reduction functions - small_field/impls.rs: Fp/Fq implementations and tests - small_field/mod.rs: re-exports and helper functions Moved batching configuration types (NoBatching, Batching<K>, BatchingMode, SmallMultiEqConfig, I32NoBatch, I64Batch21) from small_field to gadgets/small_multi_eq.rs where they logically belong, since they're specifically for constraint batching in SmallMultiEq. Added detailed documentation for I64Batch21 explaining why K=21 is the safe maximum: with SHA-256-like circuits having ~200 terms and 2^34 positional coefficients, batching 21 constraints keeps the worst-case magnitude (2^62) under the i64 signed limit (2^63).

contributions Refactors shared logic between Spartan and generic accumulator builders.

Improves type safety and self-documentation by replacing (bool, [u64; N]) with an explicit enum indicating whether the result is positive (a >= b) or negative (a < b).

Move wide_limbs.rs content and limb arithmetic from barrett.rs into a unified small_field/limbs.rs module for delayed modular reduction.

Split monolithic lagrange.rs (1667 lines) into focused submodules: - domain.rs: LagrangePoint, LagrangeHatPoint, LagrangeIndex - evals.rs: LagrangeEvals, LagrangeHatEvals - basis.rs: LagrangeBasisFactory, LagrangeCoeff - extension.rs: LagrangeEvaluatedMultilinearPolynomial - accumulator.rs: RoundAccumulator, LagrangeAccumulators - accumulator_builder.rs: build_accumulators_spartan, build_accumulators Consolidate related files into the module: - accumulator_index.rs → index.rs - thread_state_accumulators.rs → thread_state.rs - eq_linear.rs → eq_round.rs Simplify extend_in_place API: use std::mem::swap to ensure result is always in first buffer, eliminating conditional buffer selection at call sites. Rename buf_a/b to buf_curr/scratch for clarity.

- Refactor SmallMultiEq from struct to trait with NoBatchEq and BatchingEq<K> implementations - Add addmany module with limbed (i32) and full (i64) addition algorithms - Deduplicate SHA-256 circuits into examples/circuits/sha256/ module - Update small_uint32 and small_sha256 to use SmallMultiEq trait

phase - Extend MatVecMLE trait with UnreducedFieldField type for F×F accumulation - Add unreduced bucket accumulators to SpartanThreadState - Replace eyx precomputation with direct e_y access and z_beta = ex * tA_red - Keep unreduced across all x_out iterations and merge without reduction - Pre-compute beta values to eliminate closure overhead in scatter loop - Final Montgomery reduction only once per bucket after thread merge This reduces Montgomery reductions from ~7000+ per x_out to ~26 total for typical parameters (l0=3, 128 x_outs).

savings Replace asymmetric l/2 split with balanced ceil/floor split. This reduces precomputation cost (e.g., 36→24 for l=10, l0=3), enables odd number of rounds, and improves cache utilization by making e_xout smaller.

between runs Wraps each benchmark in its own scope block so large polynomial vectors are dropped before the next benchmark starts. Reduces peak memory from ~78GB to ~26GB for num_vars=28. Also removes the unnecessary even num_vars constraint from Algorithm 6.

Implement Spartan Engine trait for BN254 (alt_bn128) including: - Barrett reduction constants and field operations for BN254 Fr - SmallValueField<i64> and DelayedReduction<i64> implementations - Bn254Engine with Hyrax PCS and Keccak256 transcript

- Add --field flag to select Pallas, Vesta, or BN254 curves - Add --methods flag to choose benchmarks (base, i32, i64) - Add --trials flag for multiple runs per num_vars - Separate setup and prove timings in CSV output - Make benchmarks generic over Engine type

Move beta_values Vec from per-iteration allocation to thread-local buffer in SpartanThreadState. Reduces allocations from O(num_x_out) to O(num_threads) in the scatter phase.

stack array - Use Vec::with_capacity(num_rounds) for r and polys vectors - Replace heap-allocated vec![...] with stack array for per-round evals

Add bind_three_polys_top helper that binds three polynomials together, reducing Rayon dispatches from 2 to 0-1 per round and using serial fallback for small polynomials (n < 4096) to avoid scheduling overhead.

Uses two-phase wide-limb accumulation to reduce Montgomery reductions from O(2^k) to O(2^{k/2}) per round in prove_cubic_with_three_inputs_small_value. Exploits split-eq factorization E[id] = E_out[x_out] * E_in[x_in].

Extract extend_single and extend_batch4 helpers to process 4 suffix elements together, enabling instruction-level parallelism on AArch64.

Adds prove_cubic_with_three_inputs_split_eq_delayed to measure the effect of delayed modular reduction in eq polynomial evaluation separately from small-value accumulator precomputation.

Introduces DelayedModularReductionMode trait with DelayedModularReductionEnabled and DelayedModularReductionDisabled marker types for zero-cost compile-time strategy selection. This enables benchmarking DMR speedup without runtime branching overhead. Key changes: - Add delay_modular_reduction_mode.rs with AccumulateProduct and Mode traits - Make RoundAccumulator/LagrangeAccumulators generic over element type - Simplify MatVecMLE: move accumulation logic to Mode trait - Add accum_bench example for DMR comparison benchmarks - Make lagrange_sumcheck module public for external benchmarking

wu-s-john · 2026-01-22T20:50:51Z

@microsoft-github-policy-service agree

wu-s-john added 9 commits December 18, 2025 14:52

Add domain types for Algorithm 6 sumcheck optimization

e426c34

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.

Add Lagrange domain extension for multilinear polynomials

d0f8eed

Implements LagrangeEvaluatedMultilinearPolynomial with from_multilinear() factory method that extends evaluations from {0,1}^n to U_d^n.

Add index mapping for Algorithm 6 sumcheck optimization (Definition A.5)

d23b654

Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.

Add suffix eq-polynomial pyramid for Algorithm 6 sumcheck optimization

e4444fd

Add gather_prefix_evals and UdTuple::from_binary for Algorithm 6

b0908d7

Extracts strided polynomial evaluations for all binary prefixes b ∈ {0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6 (Lagrange extension).

Add bit-ordering sanity tests for eq, gather, and binding

7b32a0c

Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.

wu-s-john changed the title ~~Implement Algorithm 6 Foundation — Procedure 9 Accumulator Builder~~ Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Dec 18, 2025

wu-s-john added 13 commits December 19, 2025 15:05

Add Lagrange tensor evaluation helper and tests

006fb04

Add eq round factor helper and tests

6465006

Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds

Fix clippy warnings in accumulator and Lagrange loops

d9c75df

Replace range-indexed loops and a redundant closure with iterator forms

Implement Lagrange sum-check round helpers

cc373c1

Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.

Add small-value sumcheck round harness and parity test

f8d5308

Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.

Improve small-value accumulators: use quadratic t_i and speed up

312bea3

indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.

Add generic Csr<T> data structure and refactor accumulator cache

93ee149

Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators

Add SmallValueSumCheck::from_accumulators factory method

1a81943

Split build_accumulators: add Spartan-optimized and generic Procedure 9

03b5293

variants Spartan version (D=2) skips binary betas since satisfying witnesses have Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial products.

wu-s-john force-pushed the feat/procedure-9-accumulator branch from 2828f04 to 67674c4 Compare December 23, 2025 19:33

wu-s-john added 4 commits December 23, 2025 13:29

Add type-safe UdEvaluations and UdHatEvaluations wrappers for domain

f6af2b4

evaluations Replace raw arrays and ad-hoc structs with proper abstractions for U_d = {∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove EqRoundValues in favor of UdEvaluations<F, 2>.

wu-s-john changed the title ~~Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder~~ Implement Small-Value Sum-Check Optimization (Algorithm 6) Dec 23, 2025

wu-s-john marked this pull request as ready for review December 23, 2025 23:56

wu-s-john added 11 commits January 8, 2026 08:55

Refactor sumcheck_sweep to use clap CLI instead of env vars

bd0d913

Fix formatting and clippy warnings

e434257

- Apply rustfmt formatting fixes in accumulators.rs - Fix clippy manual_is_multiple_of warning in test code

Added clap cli command

88ba34b

Improve accumulator helpers: dedupe eq-table prep and scatter beta

406b59e

contributions Refactors shared logic between Spartan and generic accumulator builders.

wu-s-john force-pushed the feat/procedure-9-accumulator branch from 878e7b0 to 406b59e Compare January 9, 2026 06:00

wu-s-john added 17 commits January 8, 2026 22:33

Refactor sub_mag to return SubMagResult enum instead of tuple

997556f

Improves type safety and self-documentation by replacing (bool, [u64; N]) with an explicit enum indicating whether the result is positive (a >= b) or negative (a < b).

Consolidate limb operations into small_field/limbs module

c90062d

Move wide_limbs.rs content and limb arithmetic from barrett.rs into a unified small_field/limbs.rs module for delayed modular reduction.

Fix clippy lints in lagrange_accumulator: add Default and is_empty impls

4b85d84

Reuse beta_values buffer in accumulator builder to reduce allocations

6166bd8

Move beta_values Vec from per-iteration allocation to thread-local buffer in SpartanThreadState. Reduces allocations from O(num_x_out) to O(num_threads) in the scatter phase.

Reduce allocations in small-value sumcheck: pre-allocate vectors and use

c14e69d

stack array - Use Vec::with_capacity(num_rounds) for r and polys vectors - Replace heap-allocated vec![...] with stack array for per-round evals

Optimize small-value sumcheck binding: combine A/B/C in single pass

3685c61

Add bind_three_polys_top helper that binds three polynomials together, reducing Rayon dispatches from 2 to 0-1 per round and using serial fallback for small polynomials (n < 4096) to avoid scheduling overhead.

Add delayed reduction for eq sumcheck remaining rounds

9e8d1dc

Uses two-phase wide-limb accumulation to reduce Montgomery reductions from O(2^k) to O(2^{k/2}) per round in prove_cubic_with_three_inputs_small_value. Exploits split-eq factorization E[id] = E_out[x_out] * E_in[x_in].

Add ILP batch-4 optimization to Lagrange extension

2cc6af0

Extract extend_single and extend_batch4 helpers to process 4 suffix elements together, enabling instruction-level parallelism on AArch64.

Add split-eq-dmr benchmark: isolate delayed modular reduction impact

9adb677

Adds prove_cubic_with_three_inputs_split_eq_delayed to measure the effect of delayed modular reduction in eq polynomial evaluation separately from small-value accumulator precomputation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Small-Value Sum-Check Optimization (Algorithm 6) #98

Implement Small-Value Sum-Check Optimization (Algorithm 6) #98

Uh oh!

wu-s-john commented Dec 18, 2025 •

edited

Loading

Uh oh!

wu-s-john commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement Small-Value Sum-Check Optimization (Algorithm 6) #98

Are you sure you want to change the base?

Implement Small-Value Sum-Check Optimization (Algorithm 6) #98

Uh oh!

Conversation

wu-s-john commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

Key Insight

Benchmarks

Headline Result: 1.83× Speedup on BN254

Scaling Across Problem Sizes

Delayed Modular Reduction Impact

Accumulator Building Phase

Time Breakdown: First l0 Rounds vs Remaining Rounds

Split-Eq Sumcheck with DMR

SHA-256 Chain Benchmark

Solana Light Client Comparison

Implementation

Core Components

Algorithm Flow

Test Plan

References

Uh oh!

wu-s-john commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wu-s-john commented Dec 18, 2025 •

edited

Loading