Skip to content

Conversation

@wu-s-john
Copy link

@wu-s-john wu-s-john commented Dec 18, 2025

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.

Key Insight

In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.

Multiplication Cost Hierarchy:

  • ss (small × small): Native i32/i64 multiplication (~1 cycle)
  • sl (small × large): Barrett-optimized multiplication (~9 base mults)
  • ll (large × large): Full Montgomery multiplication (~32 base mults)

For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.

Benchmarks

Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with jemalloc.
Note: halo2curves/asm is not enabled (unavailable on Apple Silicon).

Headline Result: 1.83× Speedup on BN254

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):

sumcheck_bench_bn254-fr_n26_l03
Percentile Base prove (ms) Small-value prove (ms) Speedup
25% 2,588 1,418 1.79×
50% (median) 2,616 1,429 1.82×
75% 2,662 1,473 1.84×
90% 2,895 1,577 1.88×
  • Mean speedup: 1.83× (base: 2,671ms → small-value: 1,464ms)
  • Lower variance in optimized version (std: 73ms vs 151ms)
  • Consistent speedup across all percentiles

Scaling Across Problem Sizes

cargo run --release --example sumcheck_sweep -- --field bn254-fr range-sweep --min 16 --max 27
num_vars n base prove (µs) small-value prove (µs) speedup
16 65,536 9,459 4,619 2.05×
17 131,072 8,795 4,516 1.95×
18 262,144 13,582 7,567 1.80×
19 524,288 24,293 14,199 1.71×
20 1,048,576 49,350 25,117 1.97×
21 2,097,152 96,088 62,382 1.54×
22 4,194,304 176,852 94,437 1.87×
23 8,388,608 349,647 190,513 1.84×
24 16,777,216 679,342 365,720 1.86×
25 33,554,432 1,409,206 729,882 1.93×
26 67,108,864 2,799,654 1,493,126 1.88×
27 134,217,728 5,671,207 3,066,167 1.85×

Key observations:

  • Consistent 1.8-2.0× speedup across all problem sizes on BN254
  • Speedup remains stable even at n = 2²⁷ (134M constraints)
  • Peak speedup of 2.05× at n = 2¹⁶

Delayed Modular Reduction Impact

To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.

Accumulator Building Phase

The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.

cargo run --release --example accum_bench -- --field bn254-fr --l0 3 range-sweep --min 16 --max 27
num_vars n DMR accum (µs) no-DMR accum (µs) accum speedup
16 65,536 1,058 3,983 3.76×
17 131,072 1,595 8,668 5.43×
18 262,144 2,434 18,357 7.54×
19 524,288 3,831 15,732 4.11×
20 1,048,576 7,708 29,625 3.84×
21 2,097,152 13,739 55,788 4.06×
22 4,194,304 25,401 120,680 4.75×
23 8,388,608 55,910 223,901 4.00×
24 16,777,216 116,574 458,829 3.94×
25 33,554,432 242,699 935,360 3.85×
26 67,108,864 516,107 1,891,151 3.66×
27 134,217,728 1,024,774 3,794,770 3.70×

Key observations:

  • DMR provides 3.7-7.5× speedup on accumulator building with l0=3
  • Peak speedup of 7.54× at n = 2¹⁸
  • This is the primary source of performance gains in the small-value optimization

Time Breakdown: First l0 Rounds vs Remaining Rounds

With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:

num_vars n first l0 (ms) remaining (ms) ratio (first:remaining)
16 65,536 1.2 3.4 0.34 : 1
17 131,072 1.7 2.8 0.60 : 1
18 262,144 2.5 5.0 0.50 : 1
19 524,288 3.9 10.3 0.38 : 1
20 1,048,576 7.8 17.3 0.45 : 1
21 2,097,152 13.8 48.6 0.28 : 1
22 4,194,304 25.5 68.9 0.37 : 1
23 8,388,608 56.0 134.5 0.42 : 1
24 16,777,216 116.6 249.1 0.47 : 1
25 33,554,432 242.8 487.1 0.50 : 1
26 67,108,864 516.2 977.0 0.53 : 1
27 134,217,728 1,024.8 2,041.3 0.50 : 1

Key observations:

  • First l0 rounds (accumulators + l0 round proofs) take ~1/3 to 1/2 of total prove time
  • Remaining rounds (l0+1 to n) dominate, taking ~2× longer than first l0
  • Ratio stabilizes around 1:2 for large instances (n ≥ 2²⁴)
  • This balanced split indicates l0=3 is a good choice for BN254

Split-Eq Sumcheck with DMR

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.

cargo run --release --example sumcheck_sweep -- --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq-DMR prove (µs) prove speedup
16 65,536 9,546 5,113 1.87×
17 131,072 8,278 5,698 1.45×
18 262,144 13,570 9,973 1.36×
19 524,288 26,253 18,446 1.42×
20 1,048,576 44,574 41,793 1.07×
21 2,097,152 95,211 86,503 1.10×
22 4,194,304 172,328 136,639 1.26×
23 8,388,608 327,351 271,947 1.20×
24 16,777,216 630,354 525,370 1.20×
25 33,554,432 1,255,856 1,058,761 1.19×
26 67,108,864 2,506,240 2,171,037 1.15×
27 134,217,728 4,904,540 5,304,525 0.93×

Key observations:

  • Split-eq with DMR provides 1.15-1.87× speedup for most instance sizes
  • At n = 2²⁷, there is a slight slowdown (0.93×), likely due to increased memory pressure from DMR state
  • The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.87×
  • For very large instances (n ≥ 2²⁵), the speedup stabilizes around 1.15-1.19×

SHA-256 Chain Benchmark

To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.

cargo run --release --no-default-features --example sha256_chain_benchmark
chain_length num_vars log₂(constraints) num_constraints witness_ms orig_sumcheck_ms small_sumcheck_ms total_ms speedup witness_pct
2 16 16 65,536 14 5 3 20 1.67× 70.0%
8 18 18 262,144 55 16 11 75 1.45× 73.3%
32 20 20 1,048,576 229 48 32 301 1.50× 76.1%
128 22 22 4,194,304 1,260 163 109 1,547 1.50× 81.4%
512 24 24 16,777,216 5,686 609 395 6,743 1.54× 84.3%
2048 26 26 67,108,864 17,015 2,857 1,677 22,116 1.70× 76.9%

Key observations:

  • 2048 SHA-256 hashes proven in ~22 seconds
  • Witness generation dominates at 70-84% of total proving time
  • Small-value sumcheck achieves consistent 1.45-1.70× speedup

Solana Light Client Comparison

A Solana light client verifying block finality requires:

Component Hash Function Count
Vote signature verification SHA-512 (Ed25519 internal) ~21 to ~1,588
Merkle shred verification SHA-256 ~108 to ~1,206
  • Ed25519 uses SHA-512 internally for challenge hashing
  • Finality requires ≥2/3 supermajority stake (~21-530 validators)
  • SHA-512 is ~1.5-2× more expensive than SHA-256 per hash

SHA-256 equivalent cost:

  • Solana SHA-256: ~1,206 hashes
  • Solana SHA-512: ~1,588 × 1.5-2 = ~2,382-3,176 SHA-256 equivalent
  • Total: ~3,588-4,382 SHA-256 equivalent
  • Our 2048-chain benchmark covers ~47-57% of Solana's worst-case proving requirement

Implementation

Core Components

  1. SmallValueField trait (src/small_field.rs)

    • Defines SmallValue (i32) and IntermediateSmallValue (i64) types
    • Barrett-optimized sl_mul and isl_mul for BN254/BLS12-381 (~3× faster than ll)
    • Overflow analysis ensuring correctness for typical witness bounds
  2. Lagrange Domain Extension (src/lagrange.rs)

    • LagrangeEvaluatedMultilinearPolynomial<T, D> for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}
    • Zero-allocation extend_in_place with ping-pong buffers
    • gather_prefix_evals for efficient prefix collection (Procedure 6)
  3. Accumulator Data Structures (src/accumulators.rs, src/accumulator_index.rs)

    • SmallValueAccumulators<S, D> storing A_i(v, u) with O(1) indexing via UdTuple
    • idx4 mapping (Definition A.5) for distributing products to correct accumulators
    • Type-safe UdEvaluations and UdHatEvaluations wrappers
  4. Procedure 9 Implementation (src/accumulators.rs)

    • build_accumulators_spartan: Optimized for Spartan's Az·Bz structure
    • build_accumulators: Generic version for arbitrary polynomial products
    • Parallel fold-reduce with thread-local scratch buffers
  5. Thread-Local Buffer Reuse (src/thread_state_accumulators.rs)

    • SpartanThreadState and GenericThreadState eliminate O(num_x_out) allocations
    • Reduces allocator contention in parallel workloads
  6. Sum-Check Integration (src/sumcheck.rs)

    • SmallValueSumCheck::from_accumulators factory method
    • Round-by-round Lagrange coefficient multiplication (R_{i+1} = R_i ⊗ L_{U_d}(r_i))

Algorithm Flow

┌─────────────────────────────────────────────────────────────────────────┐
│  Precomputation: Build accumulators A_i(v, u) for i ∈ [ℓ₀]              │
│                                                                         │
│  For each x_out ∈ {0,1}^{ℓ/2-ℓ₀}:                                       │
│    For each x_in ∈ {0,1}^{ℓ/2}:                                         │
│      ein = eq(w_R, x_in) · eq(w_L, x_out)                              │
│      Extend Az/Bz prefixes to U_d^{ℓ₀} via Lagrange                    │
│      Accumulate products weighted by ein into A_i(v, u)                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Rounds 1..ℓ₀: Compute s_i(X) = ⟨R_i, A_i(·, u)⟩ for u ∈ Û_d           │
│                R_{i+1} = R_i ⊗ (L_{U_d,k}(r_i))_{k∈U_d}                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Round ℓ₀+1: Streaming round (Algorithm 2) to bind to r_{1:ℓ₀}         │
│  Rounds ℓ₀+2..ℓ: Standard linear-time sum-check (Algorithm 1)          │
└─────────────────────────────────────────────────────────────────────────┘

Test Plan

  • cargo test test_build_accumulators - Verifies accumulator construction
  • cargo test test_small_value - SmallValueField arithmetic correctness
  • cargo test lagrange - Lagrange extension and interpolation
  • cargo test sumcheck - Full sum-check protocol equivalence
  • cargo clippy - No warnings
  • examples/sumcheck_sha256_equivalence.rs - Verifies new method produces identical proofs to baseline
  • examples/sha256_chain_benchmark.rs - SHA-256 chain proving with CSV output

References

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in
src/lagrange.rs for representing evaluation domains U_d and Û_d used in
the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6)

Introduces RoundAccumulator and SmallValueAccumulators for the
small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage
with const generic D for cache efficiency and vectorizable merge
operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and
LagrangeEvaluatedMultilinearPolynomial with const generic D to enable:

- Compile-time enforcement that domain types match accumulator degree
- Debug assertions for bounds checking (v < D in constructors)
- Elimination of runtime base parameter from to_flat_index()

This prevents mixing domain sizes at compile time and catches
out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps
evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by
decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends
prefixes to the Ud domain, applies the ∞/Cz rule, and routes
contributions via cached idx4 with E_in/E_out weighting. Expanded
accumulator tests with a naive cross-check, ∞ handling, and binary-β
zero behavior to validate correctness. Cleaned up dead-code allowances
now that the code paths are used.
Added explicit MSB-first checks for eq table generation,
gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure
“top” binds the MSB.These tests catch silent index/order regressions
across components.
@wu-s-john wu-s-john changed the title Implement Algorithm 6 Foundation — Procedure 9 Accumulator Builder Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Dec 18, 2025
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check
rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1)
for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to
derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with
EqSumCheckInstance rounds.
indexing

Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix
data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2
allocations instead of N+1, improving cache locality. Replaces ad-hoc
offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value
  optimization for first ℓ₀ rounds with eq-poly optimization for
  remaining
- Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree
  parameter
- Add sumcheck_sweep.rs examples for performance comparison
build_accumulators

The new from_boolean_evals_with_buffer_reusing method takes
caller-provided scratch buffers and alternates between them during
extension. This reduces allocations from O(num_x_in × num_x_out) per
call to O(num_threads) buffers allocated once per thread.
variants

Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and
prove_cubic_with_three_inputs_small_value produce identical proofs when
used with a real SHA256 circuit (Algorithm 6 validation).

Changes:
- Add PartialEq, Eq derive to SumcheckProof for proof comparison
- Add extract_outer_sumcheck_inputs helper to SpartanSNARK
- Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.

Key changes:
  - Add SmallValueField trait for type-safe i32/i64 small-value
    operations
  - Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
  - Add SpartanAccumulatorInput trait to unify field and i32 witness
    handling
  - Make LagrangeEvaluatedMultilinearPolynomial generic over element
    type
  - Update sumcheck prover to accept separate i32 witness polynomials
  - Clean up MultilinearPolynomial<i32>: remove unused
    from_u32/from_u64/from_field
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 2828f04 to 67674c4 Compare December 23, 2025 19:33
evaluations

Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and
  UdHatPoint
- Move test-only methods (alpha, prefix_len, suffix_len,
  extend_from_boolean) to cfg(test) impl blocks
- Add CachedPrefixIndex struct with From impl to accumulator_index.rs
- Remove unused QuadraticTAccumulatorPrefixIndex type alias
- Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in
build_accumulators_spartan and build_accumulators. Previously, 5 vectors
were allocated on every x_out iteration; now allocations happen once per
Rayon thread subdivision.

- Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids
  .to_vec())
- Add SpartanThreadState and GenericThreadState structs for buffer reuse
- Extract thread state structs to thread_state_accumulators module

Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to
its own module for better code organization. Rename from
SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify
that it abstracts over multilinear polynomial representations (field
elements vs small values).
@wu-s-john wu-s-john changed the title Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Implement Small-Value Sum-Check Optimization (Algorithm 6) Dec 23, 2025
@wu-s-john wu-s-john marked this pull request as ready for review December 23, 2025 23:56
Replace per-iteration modular reductions with accumulated wide-integer
arithmetic, reducing once per beta instead of once per x_in iteration.

Key changes:
- Add WideLimbs<N> for wide unsigned integer arithmetic (6/8 limbs)
- Refactor SmallValueField to be generic over small value type (i32/i64)
- Add UnreducedMontInt types for delayed reduction in Montgomery form
- Replace SpartanAccumulatorInputPolynomial with MatVecMLE trait
- Optimize eq polynomial table computation (1 mul instead of 2 per
  element)
- Update benchmark to compare i32/i64 vs i64/i128 variants
- Add mac() helper for fused multiply-accumulate, eliminating temporary
  arrays in unreduced_mont_int_mul_add (4 implementations)
- Subtract in limb space before reduction via sub_mag(), saving one
  Barrett reduction per signed accumulator
- Replace large e_out tables with JIT-computed eyx scratch buffers,
  reducing eq table memory 7× and improving cache locality
- Add unreduced_is_zero() fast path to skip expensive modular reduction
- Precompute betas_with_infty indices to avoid filter in inner loop
- Use barrett_reduce_6_* directly for i128 products instead of padding
  to 8 limbs (saves 8 wasted multiplications per isl_mul call)
propagation

Replace mac(acc, 0, 0, carry) calls with simple overflowing_add to avoid
unnecessary u128 multiply-add pipeline for pure carry propagation. Also
add #[inline(always)] to hot path functions to ensure full inlining.
- Apply rustfmt formatting fixes in accumulators.rs
- Fix clippy manual_is_multiple_of warning in test code
Introduce circuit gadgets optimized for small-value sumcheck
optimization:

- SmallMultiEq: Batches equality constraints with bounded coefficients,
  flushing at MAX_COEFF_BITS (31) instead of bellpepper's ~237. This
  keeps constraint coefficients within i32 bounds for the small-value
  optimization.

- SmallUInt32: 32-bit unsigned integer gadget using SmallMultiEq for
  carry constraints in addmany operations.

- small_sha256: SHA-256 implementation using the above gadgets,
  producing circuits where Az and Bz values fit in i32.

- Update sumcheck_sha256_equivalence example to use bellpepper's Circuit
  trait for constraint counting, comparing SmallSha256 vs bellpepper
  SHA-256.

The tradeoff: SmallSha256 generates ~17% more R1CS constraints due to
more frequent MultiEq flushing, but enables the small-value sumcheck
optimization.

Add 16-bit limbed addition for i32 small-value optimization

SmallUInt32::addmany produces coefficients up to 2^34, exceeding i32
bounds. Splitting into 16-bit limbs reduces max coefficient to 2^18,
enabling i32/i64 small-value sumcheck for SHA-256.

- Add SmallValueConfig trait with Small32 (i32/i64) and Small64
  (i64/i128)
- Implement addmany_limbed using two constraints per addition
- Update SmallMultiEq to be generic over config
- Fix example to use config-specific bounds check
- Add examples/sha256_chain_benchmark.rs comparing original vs
  small-value sumcheck performance on SHA-256 hash chains
- CSV output includes witness synthesis time, sumcheck times, speedup,
  and witness percentage of total proving time
- CLI support: single <num_vars> for profiling, range-sweep for
  benchmarks
- Add small_sha256_with_prefix() for chaining multiple SHA-256 hashes
  with unique constraint namespaces
- Fix SmallValueField<i64> generic in lagrange.rs
- Fix unused variable warning in msm.rs
Split SmallValueField into two traits for better separation of concerns:
- SmallValueField: core small-value operations (ss_mul, sl_mul, isl_mul)
- DelayedReduction: unreduced accumulator operations for hot paths

Rename types for clarity:
- UnreducedMontInt → UnreducedFieldInt (field × integer products)
- UnreducedMontMont → UnreducedFieldField (field × field products)

Add FieldReductionConstants trait to deduplicate Barrett/Montgomery
reduction:
- Consolidates Fp/Fq constants (MODULUS, R256-R512, MONT_INV)
- Generic reduction functions monomorphized at compile time for zero
  overhead
- Comprehensive documentation explaining R constants (2^k mod p)

Performance and cleanup:
- Add ext_buf_idx scratch buffer to avoid Vec allocation in accumulator
  hot loop
- Remove unused OrderedVariable from shape_cs modules (~140 lines)
- Remove unused build_univariate_round_evals from sumcheck (~40 lines)
- Add log2_constraints column to benchmark CSV output
Split the 2,367-line small_field.rs into a proper module structure:
- small_field/small_value_field.rs: SmallValueField trait
- small_field/delayed_reduction.rs: DelayedReduction trait
- small_field/barrett.rs: Barrett/Montgomery reduction functions
- small_field/impls.rs: Fp/Fq implementations and tests
- small_field/mod.rs: re-exports and helper functions

Moved batching configuration types (NoBatching, Batching<K>,
BatchingMode, SmallMultiEqConfig, I32NoBatch, I64Batch21) from
small_field to gadgets/small_multi_eq.rs where they logically belong,
since they're specifically for constraint batching in SmallMultiEq.

Added detailed documentation for I64Batch21 explaining why K=21 is the
safe maximum: with SHA-256-like circuits having ~200 terms and 2^34
positional coefficients, batching 21 constraints keeps the worst-case
magnitude (2^62) under the i64 signed limit (2^63).
contributions

Refactors shared logic between Spartan and generic accumulator builders.
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 878e7b0 to 406b59e Compare January 9, 2026 06:00
Improves type safety and self-documentation by replacing (bool, [u64;
N]) with an explicit enum indicating whether the result is positive (a
>= b) or negative (a < b).
Move wide_limbs.rs content and limb arithmetic from barrett.rs into a
unified small_field/limbs.rs module for delayed modular reduction.
  Split monolithic lagrange.rs (1667 lines) into focused submodules:
  - domain.rs: LagrangePoint, LagrangeHatPoint, LagrangeIndex
  - evals.rs: LagrangeEvals, LagrangeHatEvals
  - basis.rs: LagrangeBasisFactory, LagrangeCoeff
  - extension.rs: LagrangeEvaluatedMultilinearPolynomial
  - accumulator.rs: RoundAccumulator, LagrangeAccumulators
  - accumulator_builder.rs: build_accumulators_spartan,
    build_accumulators

  Consolidate related files into the module:
  - accumulator_index.rs → index.rs
  - thread_state_accumulators.rs → thread_state.rs
  - eq_linear.rs → eq_round.rs

  Simplify extend_in_place API: use std::mem::swap to ensure result is
  always in first buffer, eliminating conditional buffer selection at
  call sites. Rename buf_a/b to buf_curr/scratch for clarity.
  - Refactor SmallMultiEq from struct to trait with NoBatchEq and
    BatchingEq<K> implementations
  - Add addmany module with limbed (i32) and full (i64) addition
    algorithms
  - Deduplicate SHA-256 circuits into examples/circuits/sha256/ module
  - Update small_uint32 and small_sha256 to use SmallMultiEq trait
phase

- Extend MatVecMLE trait with UnreducedFieldField type for F×F
  accumulation
- Add unreduced bucket accumulators to SpartanThreadState
- Replace eyx precomputation with direct e_y access and z_beta = ex *
  tA_red
- Keep unreduced across all x_out iterations and merge without reduction
- Pre-compute beta values to eliminate closure overhead in scatter loop
- Final Montgomery reduction only once per bucket after thread merge

This reduces Montgomery reductions from ~7000+ per x_out to ~26 total
for typical parameters (l0=3, 128 x_outs).
savings

Replace asymmetric l/2 split with balanced ceil/floor split. This
reduces precomputation cost (e.g., 36→24 for l=10, l0=3), enables odd
number of rounds, and improves cache utilization by making e_xout
smaller.
between runs

Wraps each benchmark in its own scope block so large polynomial vectors
are dropped before the next benchmark starts. Reduces peak memory from
~78GB to ~26GB for num_vars=28.

Also removes the unnecessary even num_vars constraint from Algorithm 6.
 Implement Spartan Engine trait for BN254 (alt_bn128) including:
 - Barrett reduction constants and field operations for BN254 Fr
 - SmallValueField<i64> and DelayedReduction<i64> implementations
 - Bn254Engine with Hyrax PCS and Keccak256 transcript
- Add --field flag to select Pallas, Vesta, or BN254 curves
- Add --methods flag to choose benchmarks (base, i32, i64)
- Add --trials flag for multiple runs per num_vars
- Separate setup and prove timings in CSV output
- Make benchmarks generic over Engine type
Move beta_values Vec from per-iteration allocation to thread-local
buffer in SpartanThreadState. Reduces allocations from O(num_x_out) to
O(num_threads) in the scatter phase.
stack array

- Use Vec::with_capacity(num_rounds) for r and polys vectors
- Replace heap-allocated vec![...] with stack array for per-round evals
Add bind_three_polys_top helper that binds three polynomials together,
reducing Rayon dispatches from 2 to 0-1 per round and using serial
fallback for small polynomials (n < 4096) to avoid scheduling overhead.
Uses two-phase wide-limb accumulation to reduce Montgomery reductions
from O(2^k) to O(2^{k/2}) per round in
prove_cubic_with_three_inputs_small_value. Exploits split-eq
factorization E[id] = E_out[x_out] * E_in[x_in].
Extract extend_single and extend_batch4 helpers to process 4 suffix
elements together, enabling instruction-level parallelism on AArch64.
Adds prove_cubic_with_three_inputs_split_eq_delayed to measure the
effect of delayed modular reduction in eq polynomial evaluation
separately from small-value accumulator precomputation.
Introduces DelayedModularReductionMode trait with
DelayedModularReductionEnabled and DelayedModularReductionDisabled
marker types for zero-cost compile-time strategy selection. This enables
benchmarking DMR speedup without runtime branching overhead.

  Key changes:
  - Add delay_modular_reduction_mode.rs with AccumulateProduct and Mode
    traits
  - Make RoundAccumulator/LagrangeAccumulators generic over element type
  - Simplify MatVecMLE: move accumulation logic to Mode trait
  - Add accum_bench example for DMR comparison benchmarks
  - Make lagrange_sumcheck module public for external benchmarking
@wu-s-john
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant