Skip to content

Conversation

@katyhr
Copy link
Collaborator

@katyhr katyhr commented Aug 20, 2025

We formalise the following definitions, statements and needed preliminaries thereof from [BCIKS20] refers to the paper "Proximity Gaps for Reed-Solomon Codes":

In ProximityGap.lean

  • Definition 1.1 (Proximity gaps);
  • Statement of Theorem 1.2 (Proximity gaps for Reed-Solomon codes);
  • Statements of the Correlated Agreement Theorems: Theorem 1.4 (Correlated agreement over affine lines), Theorem 1.5 (Correlated agreement over low-degree parameterised curves), and Theorem 1.6 (Correlated agreement over affine spaces).

In DivergenceOfSets.lean:

  • Statement of Corollary 1.3 (Concentration bounds);

Additionally, there are linter error fixes.


Note

Introduce proximity gap framework with RS theorems (statements), divergence of sets, and supporting affine/curve, Reed–Solomon, and probability notation utilities, plus minor refactors/notation tweaks.

  • Proximity gaps and correlated agreement (new/expanded):
    • Add ProximityGap framework: proximityMeasure, proximityGap, correlatedAgreement (moved/namespaced), generalProximityGap, and errorBound for RS.
    • State RS results: proximity_gap_RSCodes, correlatedAgreement_lines, correlatedAgreement_affine_curves, correlatedAgreement_affine_spaces.
  • Divergence of sets:
    • Define possibleDeltas and divergence (now ℚ≥0); add concentration_bounds corollary statement.
  • Preliminaries and geometry utilities:
    • Matrix helpers (neqCols, row/col spans/ranks, submatrix extractors) and rank lemmas (stated).
    • Affine utilities (lines, spans, translations, finiteness/nonemptiness), and Curve utilities (parametrisedCurve and finite variants).
  • Reed–Solomon helpers:
    • ReedSolomonCode: RScodeSet, toFinset, sqrtRate.
  • Probability notation:
    • Improve Pr_{...}[...] macro expansion and examples.
  • Interleaved/Johnson bound adjustments:
    • Tweak statements to use Code.minDist/RScodeSet; minor doc cleanups.
  • Core coding basics:
    • Make ρ notation scoped macro for LinearCode.rate.
  • Plumbing:
    • Update references/usages (Combine.combine_theorem, blueprint/lean_decls) to new namespaces/defs.

Written by Cursor Bugbot for commit ab20fe0. This will update automatically on new commits. Configure here.

@alexanderlhicks
Copy link
Collaborator

Haven't reviewed this at all yet, just checking in with the review bot

@alexanderlhicks
Copy link
Collaborator

/review

External:
https://eprint.iacr.org/2020/654.pdf
https://cse.buffalo.edu/faculty/atri/courses/coding-theory/book/web-coding-book.pdf

Internal:
ArkLib/Data/CodingTheory/
ArkLib/ProofSystem/Fri

Comments:
Please review and identify any possible issues or possible improvements.

@github-actions
Copy link

🤖 AI Review (with external context)\n\nGenerating AI review...

Hello! As an expert code reviewer, I've analyzed the pull request. Here is my review:

High-level summary

This pull request introduces significant new formalizations related to proximity gaps for Reed-Solomon codes, drawing from the Ben-Sasson et al. paper "Proximity Gaps for Reed–Solomon Codes". This includes definitions for divergence of sets, correlated agreement, and proximity gaps, along with statements of key theorems like the concentration bounds.

Additionally, the PR includes refactoring and minor fixes in the Berlekamp-Welch decoder implementation, making the proofs cleaner and more explicit. A new notation ρ for the rate of a linear code is also introduced, improving readability.

While the additions are substantial and valuable, I've identified two major issues in the new definitions and a couple of potential off-by-one errors that need to be addressed. The rest of the changes are positive improvements.

Major Issues

1. Incorrect definition of Affine.line

In ArkLib/Data/CodingTheory/Prelims.lean, Affine.line is defined as a linear subspace, which contradicts its name and its intended use based on the reference paper. An affine line should represent a set of points {p + t • v | t ∈ F}, not a linear subspace.

File: ArkLib/Data/CodingTheory/Prelims.lean
Current definition:

/-- Affine line between two vectors with coefficients in a semiring.
-/
def Affine.line {F : Type*} {ι : Type*} [Ring F] (u v : ι → F) : Submodule F (ι → F) :=
  vectorSpan _ {u, v}

This defines the linear span of u and v, which is a 2D-subspace (a plane) if they are linearly independent.

Suggestion:

The definition should be changed to represent an affine line. The reference paper (Ben-Sasson et al.) describes an affine line as {u₀ + z • u₁ | z ∈ F}. In Mathlib, this corresponds to an AffineSubspace. I recommend using Mathlib's line_through or defining it appropriately. For example, if u is a point and v is a direction vector:

import Mathlib.LinearAlgebra.AffineSpace.AffineSubspace

-- ...

/-- Affine line starting at point `u` with direction `v`. -/
def Affine.line [AddTorsor (Submodule F (ι → F)) (ι → F)] (u v : ι → F) : AffineSubspace F (ι → F) :=
  line[F, u, u + v]

Or, to match the paper's {u₀ + z • u₁}, if u is u₀ and v is u₁:

/-- Affine line starting at point `u` with direction `v`. -/
def Affine.line [AddTorsor (Submodule F (ι → F)) (ι → F)] (u v : ι → F) : AffineSubspace F (ι → F) :=
  u +ᵥ Submodule.span F {v}

This change will be required for the correct formalization of theorems in InterleavedCode.lean and ProximityGap.lean.

2. Incorrect definition of correlatedAgreement

The definition of correlatedAgreement in ArkLib/Data/CodingTheory/ProximityGap.lean appears to have the subset inclusion reversed, which changes the meaning of the property significantly.

File: ArkLib/Data/CodingTheory/ProximityGap.lean
Current definition:

def correlatedAgreement (C : Set (n → F)) (δ : ℝ≥0) {k : ℕ} (W : Fin k → n → F) : Prop :=
  ∃ S : Finset n, #(S) ≥ (1 - δ) * (Fintype.card n) ∧
    ∃ v : Fin k → n → F, ∀ i, v i ∈ C ∧ {j | v i j = W i j} ⊆ S

The paper states "for all i, vi agrees with ui on all of D'". This means the agreement set should be a superset of the large set S. The current definition states that the agreement set is a subset of S, which means agreement can only happen within S.

Suggestion:

The subset inclusion should be reversed.

def correlatedAgreement (C : Set (n → F)) (δ : ℝ≥0) {k : ℕ} (W : Fin k → n → F) : Prop :=
  ∃ S : Finset n, #(S) ≥ (1 - δ) * (Fintype.card n) ∧
    ∃ v : Fin k → n → F, ∀ i, v i ∈ C ∧ S ⊆ {j | v i j = W i j}

The diff shows a version with = instead of , which would also be correct and perhaps even clearer:

def correlatedAgreement (C : Set (n → F)) (δ : ℝ≥0) {k : ℕ} (W : Fin k → n → F) : Prop :=
  ∃ S : Finset n, #(S) ≥ (1 - δ) * (Fintype.card n) ∧
    ∃ v : Fin k → n → F, ∀ i, v i ∈ C ∧ {j | v i j = W i j} = S

Potential Issues & Questions

  1. Possible Off-by-One in errorBound:
    In ArkLib/Data/CodingTheory/ProximityGap.lean, errorBound for the Johnson bound regime uses deg ^ 2. The reference paper (Theorem 1.2) uses (k + 1)^2, where k is the polynomial degree (so dimension is k+1). In our ReedSolomon.code definition, deg is the dimension. If deg in errorBound is meant to be the dimension, then deg ^ 2 is correct. If it's meant to be the polynomial degree, it should be (deg + 1) ^ 2. Could you please clarify the intended meaning of deg here?

  2. Possible Off-by-One in correlatedAgreement_affine_curves:
    In ArkLib/Data/CodingTheory/ProximityGap.lean, the error bound for correlatedAgreement_affine_curves is k * errorBound, where u : Fin k → ι → F. The paper (Theorem 1.5) specifies l * ǫ for a curve of degree l generated by l+1 vectors {u₀, ..., uₗ}. If k in the code is the number of vectors (l+1), then the multiplier should be l, which would be k-1. This seems like a potential off-by-one error.

Minor Suggestions & Nitpicks

  • In ArkLib/Data/CodingTheory/Basic.lean, the new notation ρ for rate is a great addition for readability.
  • The refactoring in the BerlekampWelch files improves clarity. Adding the hp : p ≠ 0 hypothesis to Q'_div_E'_eq_p is a good clarification.
  • The new file ArkLib/Data/CodingTheory/DivergenceOfSets.lean is a good start to formalizing the concepts from the paper. Changing divergence to return ℚ≥0 is correct.

Overall, this is a strong contribution. Addressing the major issues will significantly improve the correctness of the new formalizations. I'm looking forward to seeing the sorrys filled in

@alexanderlhicks alexanderlhicks self-assigned this Sep 1, 2025
@Julek Julek force-pushed the Katy/ProximityWIP branch 3 times, most recently from 10e6aa2 to 720ce0f Compare September 9, 2025 18:11
@quangvdao
Copy link
Collaborator

/review

External:
https://eprint.iacr.org/2020/654.pdf
https://cse.buffalo.edu/faculty/atri/courses/coding-theory/book/web-coding-book.pdf

Internal:
ArkLib/Data/CodingTheory/
ArkLib/ProofSystem/Fri

Comments:
Please review and identify any possible issues or possible improvements. Keep in mind the previous AI review, see if the changes have been applied.

@github-actions
Copy link

github-actions bot commented Sep 9, 2025

🤖 AI Review (with external context)\n\nGenerating AI review...

Hello! As an expert code reviewer, I've analyzed the provided pull request. Here is my review:

High-level summary

This pull request introduces significant formalizations of key concepts and theorems from the paper "Proximity Gaps for Reed–Solomon Codes" [BCIKS20]. The author has done a great job laying the groundwork for proving these advanced results in Lean.

Key additions include:

  • Formal definitions for proximity gaps, divergence of sets, correlated agreement, and affine curves, directly following the reference paper.
  • Statements of several major theorems from the paper, including Corollary 1.3 (Concentration Bounds), Theorem 1.4 (Correlated Agreement over Lines), Theorem 1.5 (Correlated Agreement over Curves), and Theorem 1.6 (Correlated Agreement over Affine Spaces).
  • A new, robust probability notation Pr_{...}[...] for expressing probabilistic statements.
  • Refactoring of existing definitions in coding theory for better clarity and consistency, such as using Code.minDist instead of hardcoded distance formulas for Reed-Solomon codes.
  • A stylistic improvement for the notation of a code's rate using syntax and macro_rules for ρ.

Overall, this is a high-quality contribution that moves the project forward significantly. I've identified a few areas for improvement, mainly concerning correctness of formalization against the reference document and some minor code quality issues.

Formalization Correctness

I've compared the formalizations against the reference document "[BCIKS20] Proximity Gaps for Reed–Solomon Codes". While most definitions and theorems are captured correctly, there are a couple of discrepancies that should be addressed.

  1. correlatedAgreement_affine_curves (Theorem 1.5): Off-by-one in probability bound

    In ArkLib/Data/CodingTheory/ProximityGap.lean, the formalization of Theorem 1.5 seems to have an off-by-one error in the probability bound.

    • Paper (Thm 1.5): A curve of degree l is generated by l+1 vectors {u₀, ..., uₗ}. The probability bound is l * ε.
    • Lean code: The function takes u : Fin k → ι → F, which represents k vectors {u₀, ..., uₖ₋₁}. This defines a curve of degree k-1. Therefore, l in the paper corresponds to k-1 in the code. The probability bound should be (k-1) * errorBound, but it is currently k * errorBound.

    Suggestion:
    Modify the hypothesis hproximity as follows:

    --- a/ArkLib/Data/CodingTheory/ProximityGap.lean
    +++ b/ArkLib/Data/CodingTheory/ProximityGap.lean
    @@ -100,7 +100,7 @@
       (hδ : δ ≤ 1 - ReedSolomonCode.sqrtRate deg domain)
       (hproximity :
         Pr_{let y ← $ᵖ (Curve.parametrisedCurveFinite u)}[
           Code.relHammingDistToCode y.1 (ReedSolomon.code domain deg) ≤ δ
  •    ] >
    
  •      k * (errorBound δ deg domain)) :
    
  •    ] > (k-1) * (errorBound δ deg domain)) :
     correlatedAgreement (ReedSolomon.code domain deg) δ u := by sorry
    
    
    
  1. correlatedAgreement_affine_spaces (Theorem 1.6): Mismatch in affine space definition

    The definition of an affine subspace in the formalization of Theorem 1.6 does not seem to match the paper.

    • Paper (Thm 1.6): An affine subspace is defined as U = u₀ + span{u₁, ..., uₗ}.
    • Lean code: The code uses affineSpan F (toSet (finsetOfVectors u)). For u = (u₀, ..., uₗ), this corresponds to the set of all affine combinations ∑ aᵢuᵢ where ∑ aᵢ = 1. This is equivalent to u₀ + span{u₁ - u₀, ..., uₗ - u₀}, which is different.

    Suggestion:
    The formalization should be updated to directly represent u₀ + span{u₁, ..., uₗ} to align with the paper. This might require adjusting the function signature or the way the affine subspace is constructed. A direct construction would be clearer. For example, one could define a function that takes u₀ and a list of vectors for the span.

  2. errorBound definition: Interval endpoints

    In ArkLib/Data/CodingTheory/ProximityGap.lean, the definition of errorBound for the unique decoding regime uses the interval (0, (1-ρ)/2]. The paper [BCIKS20] specifies the interval [0, (1-ρ)/2). While this might be a minor point, it's worth double-checking if this discrepancy is intentional and correct. The definitions of real number intervals can be subtle.

Specific Feedback and Suggestions

  1. sorrys in ListDecodability.lean

    In ArkLib/Data/CodingTheory/ListDecodability.lean, the proofs for listOfCloseCodewords_eq_zero and listOfCloseCodewordsRel_eq_zero have been replaced with sorry. The original proof sketch simp [listOfCloseCodewords, Nat.card_eq_zero] seems correct. This regression was likely caused by a change in mathlib. These proofs should be restored.

    -- In ArkLib/Data/CodingTheory/ListDecodability.lean
    
    lemma listOfCloseCodewords_eq_zero :
      listOfCloseCodewords C y n = 0 ↔ IsEmpty (hammingBall C y n) ∨ Infinite (hammingBall C y n) := by
      simp [listOfCloseCodewords, Nat.card_eq_zero] -- this was previously a valid proof?
      sorry -- This should be fixed. The proof likely needs a small adjustment.
    
    lemma listOfCloseCodewordsRel_eq_zero :
      listOfCloseCodewordsRel C y r = 0 ↔
      IsEmpty (relHammingBall C y r) ∨ Infinite (relHammingBall C y r) := by
      simp [listOfCloseCodewordsRel, Nat.card_eq_zero] -- this was previously a valid proof?
      sorry -- This should also be fixed.

    Suggestion: Please investigate why the simp tactic is failing and restore these proofs. It's important to avoid introducing new sorrys.

  2. Unused Definitions in ProximityGap.lean

    The file ArkLib/Data/CodingTheory/ProximityGap.lean defines proximityMeasure and proximityGap, but they don't seem to be used in the formalization of the theorems from [BCIKS20]. Their names are quite general. If they are intended for another purpose (e.g., formalizing results from the Ligero paper as hinted by changes in InterleavedCode.lean), it would be helpful to:

    • Add docstrings explaining their origin and purpose.
    • Consider moving them to a more specific file (like InterleavedCode.lean) if they are only used in that context.
  3. Notation for Rate ρ

    The new notation for rate ρ in ArkLib/Data/CodingTheory/Basic.lean using syntax and macro_rules is a great improvement over notation. It's more robust and scoped appropriately. This is a good example of idiomatic Lean 4.

    syntax &"ρ" term : term
    
    macro_rules
       | `(ρ $t:term) => `(LinearCode.rate $t)

I'm impressed with the progress and the quality of the formalization. Addressing these points will make this contribution even stronger. Keep up the excellent work

@quangvdao
Copy link
Collaborator

@katyhr can you fix these errors? I think they are indeed errors.

@katyhr katyhr marked this pull request as draft September 10, 2025 00:14
@katyhr
Copy link
Collaborator Author

katyhr commented Sep 10, 2025

@quangvdao fixed Theorem 1.5 and amended k to k + 1 and k - 1 to k; fixed the broken proofs in ListDecodability.lean. I don't agree with the errorBound endpoints comment. Our version is consistent with the Proximity Gaps paper, so I haven't changed anything there.

@Julek fixed the correlatedAgreement_affine_spaces but is doing some extra tidying up to reduce the instances we used.

refactor Q to Q>=0, fix proofs

relHamming dist proofs

minor cleanup

more cleanups

final minor cleanups

ProximityGaps draft1

some stuff

proximity wip

line 257

WIP - thing3 + nonempty explicitly given

fix decidable equality

def generalProximityGap

fix generalProximityGap

or something like this

latest WIP

proximityGaps results

cleanup proximity, reedsolomon, divergence and prelims

fix the variable name

minor edits, added authors in ProximityGap and linter error fixes

filled in a small sorry

fix of linter errors + other small fixes

more linter fixes

fixed order of imports

Review of ProximityGap PR

minor cleanup of comments

Co-authored-by: Frantisek Silvasi <[email protected]>
Co-authored-by: Julian Sutherland <[email protected]>
Co-authored-by: Katerina Hristova <[email protected]>
@Julek Julek marked this pull request as ready for review September 26, 2025 23:55
Co-authored-by: Katerina Hristova <[email protected]>
Co-authored-by: Julian Sutherland <[email protected]>
cursor[bot]

This comment was marked as outdated.

@quangvdao quangvdao merged commit e0b4ef1 into Verified-zkEVM:main Oct 4, 2025
4 checks passed
@github-actions
Copy link

github-actions bot commented Oct 4, 2025

🤖 Gemini PR Summary

This diff introduces a significant formalization of advanced coding theory concepts, particularly proximity gaps for Reed-Solomon codes, while also performing widespread code cleanup and refinement.

Here are the key changes:

  • Formalization of Proximity Gaps:

    • New files (ProximityGap.lean, DivergenceOfSets.lean) are expanded to define concepts like generalProximityGap, errorBound, and correlatedAgreement.
    • Several major theorems from the research paper "[BCIKS20] Proximity Gaps for Reed-Solomon Codes" are stated, laying the groundwork for their formal proofs.
  • Expansion of Foundational Math Libraries:

    • To support the new theorems, the Prelims.lean file is significantly enhanced with a rich API for working with affine subspaces and parameterized curves over finite fields.
  • Improved Code Rate Notation:

    • The notation for code rate (ρ) in Basic.lean is updated to a scoped macro (&"ρ"). This makes the notation non-reserved, preventing potential name clashes with variables named ρ.
  • General Code Cleanup and Refinement:

    • Docstrings across numerous files have been standardized and improved for clarity.
    • Code is made more abstract and maintainable; for example, a hardcoded formula for minimum distance in InterleavedCode.lean is replaced with a call to the minDist property.
    • Documentation for the custom probability notation (Pr_{...}[...]) is expanded with better explanations and examples.

Analysis of Changes

Metric Count
📝 Files Changed 12
Lines Added 412
Lines Removed 149

sorry Tracking

  • Added: 5 sorry(s)
    • theorem correlatedAgreement_affine_curves [DecidableEq ι] {k : ℕ} {u : Fin k → ι → F} in ArkLib/Data/CodingTheory/ProximityGap.lean
    • theorem correlatedAgreement_affine_spaces {k : ℕ} [NeZero k] {u : Fin (k + 1) → ι → F} in ArkLib/Data/CodingTheory/ProximityGap.lean
    • lemma concentration_bounds {deg : ℕ} {domain : ι ↪ F} in ArkLib/Data/CodingTheory/DivergenceOfSets.lean
    • theorem correlatedAgreement_lines {u : Fin 2 → ι → F} {deg : ℕ} {domain : ι ↪ F} {δ : ℝ≥0} in ArkLib/Data/CodingTheory/ProximityGap.lean
    • theorem proximity_gap_RSCodes {k t : ℕ} [NeZero k] [NeZero t] {deg : ℕ} {domain : ι ↪ F} in ArkLib/Data/CodingTheory/ProximityGap.lean
  • ✏️ Affected: 2 sorry(s) (line number changed)

Last updated: 2025-10-04 21:53 UTC. See the main CI run for build status.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to the general ToMathLib folder to make it easier to keep track of?

{p : Polynomial R} {f g : ℕ → R → S}
(h : ∀ i x, f i x = g i x)
: p.sum f = p.sum g := by
aesop (add simp [Polynomial.sum])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could replace this with simp_all only [Polynomial.sum] which is less likely to break if there are changes to aesop?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is pretty optional, I realise there are many other aseop calls that could similarly be refactored, not sure if we have a consensus preference about this?)

/-- `Corollary 1.3` (Concentration bounds) from [BCIKS20]. -/
lemma concentration_bounds [Fintype F] [Field F] [Fintype ι] {deg : ℕ} {domain : ι ↪ F}
{U : AffineSubspace F (ι → F)} [Nonempty U]
(hdiv : (divergence U (RScodeSet domain deg) : ℝ≥0) ≤ 1 - ReedSolomonCode.sqrtRate deg domain)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a small deviation from the paper here, as it uses δ* ∈ (0, 1 − √ρ) i.e. strict inequality.

{F : Type*}

/-- `Definition 1.1` in [BCIKS20]. -/
noncomputable def generalProximityGap {α : Type*} [DecidableEq α] [Nonempty α]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the paper's wording

Definition 1.1 (Proximity gap). Let P ⊂ Σn be a property and C ⊂ 2Σn
be a collection of sets.
Let ∆ be a distance measure on Σn. We say that C displays a (δ, ε)-proximity gap with respect to P
under ∆ if every S ∈ C satisfies exactly one of the following:

  1. Pr s∈S[∆(s, P) ≤ δ] = 1.
  2. Pr s∈S[∆(s, P) ≤ δ] ≤ ε.
    We call δ the proximity parameter and ε is the error parameter. By default, ∆ denotes the relative
    Hamming distance measure.

Perhaps this should include some constraint like ε < 1 to avoid the case where ε =1 and both conditions are satisfied, or is it basically fine for both to be satisfied? (To some extent this is implicit in the paper as it only considers scenarios where ε is much smaller than 1.)

{k : ℕ}

/-- Affine line between two vectors with coefficients in a semiring. -/
def line [Ring F] (u v : ι → F) : Submodule F (ι → F) :=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in #143 this is redefined as

def Affine.line {F : Type*} {ι : Type*} [Semiring F] (u v : ι → F) : Set (ι → F) :=
  {x | ∃ a : F, x = u + a • v}

vectorSpan returns a submodule, the above returns a set, and it seems to be me we should also consider affineSpan which seems closer to what we want as it uses affine subspaces (which can be obtained from Submodules too) which show up in Sections 6 and 7 of the proximity gaps paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants