Skip to content

Conversation

@maru-ava
Copy link
Contributor

Why this should be merged

How this works

How this was tested

Need to be documented in RELEASES.md?

aaronbuchwald and others added 30 commits July 7, 2025 11:41
Adds a concurrency group to the attach static libs workflow
https://docs.github.com/en/enterprise-cloud@latest/actions/reference/workflow-syntax-for-github-actions#example-using-concurrency-and-the-default-behavior
.

The attach static libs GitHub action pushes/pulls to a branch derived
from the PR branch title to
https://github.com/ava-labs/firewood-go-ethhash throughout the workflow.
If multiple commits on the same branch concurrently attempt to push/pull
the same branch they may interfere leading to unexpected results (pulled
different code than it pushed) or git errors.

This change ensures that the workflow started by commit 1 will be
cancelled before starting the workflow for commit 2.

This is intended to fix an issue observed in CI on
ava-labs/firewood#1024

CI running for commit 0
https://github.com/ava-labs/firewood/actions/runs/16055519899/job/45308974261#step:3:8
succeeded, but the timestamps show it ran concurrently with the next
commit, which failed to push
https://github.com/ava-labs/firewood/actions/runs/16055531848/job/45308958694#step:9:66.

<img width="1349" alt="Screenshot 2025-07-07 at 11 34 22"
src="https://github.com/user-attachments/assets/1780572f-1b15-49c0-bbf3-b80d47b9a415"
/>

<img width="1346" alt="Screenshot 2025-07-07 at 11 37 49"
src="https://github.com/user-attachments/assets/a36b4ccb-d27a-4920-92f6-fc2b9b7605ea"
/>

Subsequent workflow runs failed to check out the branch:


https://github.com/ava-labs/firewood/actions/runs/16055543248/job/45309129819

<img width="1343" alt="Screenshot 2025-07-07 at 11 39 28"
src="https://github.com/user-attachments/assets/e959eaf2-a198-4267-9df4-fb0f4e141c5b"
/>

Conflicting attempts to push to the same branch seems to be the most
likely cause given the conflicting timestamps and attempts to force
push/pull to the same branch name when multiple commits can run the same
workflow in parallel.
Gauges weren't being reported via the homegrown HTTP endpoint we used
for ffi, now they are. This is particularly important since we now have
a few useful gauges in firewood.

The surrounding HTTP structure might be going away, but the underlying
`stats` call is likely to stay, so this enhancement is probably still
useful.

Also took the opportunity to reduce memory allocations here, and write
directly to strings instead of converting back and forth between
string/bytes/string with all the utf8 conversions.

Histograms are still not supported, but we don't have any, and aren't
likely to add any anytime soon.
Upcoming changes make supporting this harder, and it's not necessary, as
the size of a LinearAddress and the size of a pointer are the same, so
it's more efficient to return the actual value rather than a pointer to
the value.
Since github actions is his favorite programming language, let's keep
him as a reviewer of those tickets. Everything else should be reviewed
by the firewood team now.
This cleans up a lot of the remaining `#[allow]` attributes in the code
base. Only one remains,
`firewood-macros/tests/compile_pass/with_attributes.rs` because I did
not want to alter the tests of the macro.
Introduces a new type, `MaybePersistedNode`, which is a node which is
either a SharedNode or a LinearAddress. This new type will allow writes
to occur while readers are still reading this revision using an
`ArcSwap`.

Note that this does not actually put any additional nodes in memory, it
just sets things up so they could be in memory.

Additional changes needed for this feature to be complete are:

- [ ] BranchNode children can now be Node, LinearAddress or a
MaybePersistedNode
- [ ] When converting a `MutableProposal` to an `ImmutableProposal`,
don't actually allocate space, just create `SharedNode`s from the
`Node`s.
- [ ] Remove `NodeStoreHeader` from `NodeStore`. The header should move
into the `RevisionManager`.
- [ ] Allocate new nodes from the recently-freed nodes from an expired
revision, then from the freelist. This can be done by maintaining a
local freelist from the nodes deleted in the expiring revision and
falling back to the actual freelist. Deleted nodes that are not reused
must still be pushed onto the freelist.
Nodes on the delete list might be unpersisted. This can't actually
happen yet, but support for it is there. This is half of making branch
children unpersisted.

Reaping unpersisted nodes is easy, we just drop them. If they are
persisted, they need to move to the freelist, as they did before.
Branch children might also not be persisted.

This also splits up hash_helper which got too long for clippy.

- [x] Roots may not be persisted (Part 1: #1041)
- [x] BranchNode children can now be Node, LinearAddress or a
MaybePersistedNode (Part 2: #1045 and this PR #1047)
- [ ] When converting a `MutableProposal` to an `ImmutableProposal`,
don't actually allocate space, just create `SharedNode`s from the
`Node`s.
- [ ] Remove `NodeStoreHeader` from `NodeStore`. The header should move
into the `RevisionManager`.
- [ ] Allocate new nodes from the recently-freed nodes from an expired
revision, then from the freelist. This can be done by maintaining a
local freelist from the nodes deleted in the expiring revision and
falling back to the actual freelist. Deleted nodes that are not reused
must still be pushed onto the freelist.
This PR enables Firewood to export metrics so that clients such as
Coreth can aggregate those metrics into its own.

## How
Splits the Firewood API into the following:

```go
// Start recorder
func StartMetrics()

// Start recorder and exporter
func StartMetricsWithExporter()

// Gather latest metrics
func GatherMetrics() (string, error)
```

`GatherMetrics()` is used by to create a custom Prometheus gatherer that
can used by Coreth's Multigatherer.

## Testing

Extended the existing metrics test to account for `Gatherer`.
Any error from Firewood is obviously a Firewood error, the user should
be able to figure this out.
A nodestore and its related types evolved over time becoming more
complex. This change splits up the functionality into different areas to
make it easier to find specific parts of the nodestore logic.
    
Doc comments about the layout are extensive and are not repeated here.

Other than moving things around, some additional changes are:
 - `UpdateError` was dead code. It was used in early development of
   merkle's insert but is no longer used.
 - Accessors for various members of NodeStoreHeader instead of
   referencing them directly. This makes auditing their use a little
   easier, and improves scoping and readability.
dump has to work with unpersisted nodes, as there are a bunch of tests
that assume an immutable proposal can be dumped. This was useful anyway
while debugging some tests in the next commit, and small enough to
commit separately.
While looking into Taskfiles for `ffi`, I noticed that we're still using
`golangci-lint v1` while AvalancheGo is already using `golangci-lint
v2`. This PR upgrades the version of `golangci-lint` used in `ffi` to
`v2`.

To maintain synchrony between the Firewood and AvalancheGo linter rules,
an additional CI job (`expected-golangci-yaml-diff`) ensures that the
difference between the two `.golangci.yaml` files is equal to the
contents of `golangci_yaml_expected_changes.txt`.

## What

- Copied over current AvalancheGo `golangci.yml` file and removing
AvalancheGo-specific rules.
- Bumped version of `golangci-lint` action used in CI.
- Added `expected-golangci-yaml-diff` CI job
- Linted files

## How This Was Tested

Ran `golangci-lint` locally + CI.
The way we used serde when serializing free areas but not other nodes
was inconsistent and caused an issue were we were expecting data to be
encoded one way but in actuality it was encoded another way.

This commit removes the dependency on serde for serialization in the
storage crate, specifically in the `FreeArea` serialization. This
ensures that all serialization is done manually, which is consistent and
avoids the issues we were facing with serde's encoding.

Fixes: #831 #1057
Add functions to split leaked ranges into areas that can be added to
free lists. This concludes this diff: #1033.
This PR adds checks for stored area misalignment. We only focus on areas
that are not leaked (areas tracked in the trie or in the free list). As
long as the areas not leaked are aligned, leaked areas will be aligned.

When a misaligned area is identified, we return not only the address of
this area but also who points to that area directly. If this area is
allocated for a trie node, the root or the parent trie node will be
returned. If this area is free, the free list head or the parent of this
free area is returned. We return the parent because, if we were to fix
the misalignment issue (by migrating to another area or remove it from
the free list), the parent pointer needs to be updated accordingly.

As a side effect, `FreeListIterator` will return the parent of the free
list item. This design can help checker fix various issues it sees when
traversing the free list but will add some overhead if it is used for
other purposes.
It's cumbersome (and more prone to TOCTOU issues) for users to validate
whether a file exists, and if it does, set a flag. This should be
handled internally. The `truncate` config now more closely resembles
it's actual meaning. If the file already exists, it will be wiped and
re-created.

There are unit tests in Rust and in Go for this.

Close #1039
An UnpersistedIterator iterates through a NodeStore looking for
unpersisted nodes. Unpersisted nodes are only of the type
MaybeUnpersisted that are not yet persisted.

Note that it's impossible to have a persisted parent for a
MaybeUnpersistedNode, so it's relatively easy to find the unpersisted
ones.

There are tests that verify that the nodes are returned in the order
that they will need persisting (that is, depth first).

---------

Signed-off-by: Ron Kuris <[email protected]>
Co-authored-by: Suyan Qu <[email protected]>
Branch nodes were already handling the overflow correctly, but leaf
nodes were not. A failing test case was added to ensure this fix works.

Previously:

```text
---- node::test::test_serialize_deserialize::leaf_node_obnoxiously_long_partial_path stdout ----

thread 'node::test::test_serialize_deserialize::leaf_node_obnoxiously_long_partial_path' panicked at storage/src/node/mod.rs:553:9:
assertion `left == right` failed
  left: Leaf([Leaf 7 4 6 8 6 9 7 3 2 0 6 9 7 3 2 0 6 1 2 0 7 2 6 5 6 1 6 c 6 c 7 9 2 0 6 c 6 f 6 e 6 7 2 0 7 0 6 1 7 2 7 4 6 9 6 1 6 c 2 0 7 0 6 1 7 4 6 8 2 c 2 0 6 c 6 9 6 b 6 5 2 0 7 3 6 f 2 0 6 c 6 f 6 e 6 7 2 0 6 9 7 4 2 7 7 3 2 0 6 d 6 f 7 2 6 5 2 0 7 4 6 8 6 1 6 e 2 0 3 6 3 3 2 0 6 e 6 9 6 2 6 2 6 c 6 5 7 3 2 0 6 c 6 f 6 e 6 7 2 0 7 7 6 8 6 9 6 3 6 8 2 0 7 4 7 2 6 9 6 7 6 7 6 5 7 2 7 3 2 0 2 3 3 1 3 0 3 5 3 6 2 e  04050607])
 right: Leaf([Leaf [invalid ca] 1 7 4 6 8 6 9 7 3 2 0 6 9 7 3 2 0 6 1 2 0 7 2 6 5 6 1 6 c 6 c 7 9 2 0 6 c 6 f 6 e 6 7 2 0 7 0 6 1 7 2 7 4 6 9 6 1 6 c 2 0 7 0 6 1 7 4 6 8 2 c 2 0  0c0609060b06])
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

fixes #1056

---------

Signed-off-by: Joachim Brandon LeBlanc <[email protected]>
This test inserts a bunch of keys and then verifies that the resulting
hashes are the same if we delete them in the reverse order. However, the
test is generating a duplicate key, which will not reproduce the
original hash set.

Changed the test to remove any duplicate keys found. It's okay if the
data is the same.

Now passes with the errant key from #1075

Also fixed the merkle dump generator, which I used to debug this. A
bracket was missing so graphviz was failing to parse the file. Also
shortened the hashes as they made the nodes look much bigger, and added
a new line. See attached merkle dump for the new format.
)

The size of the underlying file is not reliable on certain OS, and the
high watermark can be either larger or smaller than the actual file
size. Therefore, the checker should not compare the high watermark
against the actual file size.
While traversing the trie, we can optionally check if all hashes are valid.
I noticed I was not auto-added to reviews for the ffi and .github
directories, and I discovered that this is because the wildcard rule was
being ignored if a more specific rule matched.

I have updated the `CODEOWNERS` file to ensure that Ron and I are added
to all reviews.
The checker decides whether to enable ethhash at build time through
feature flags. Exposing this feature flag when building `fwdctl`.
Copilot AI and others added 15 commits November 17, 2025 17:33
- [x] Successfully rebased on main (commit 03f12df)
- [x] Created METRICS.md with comprehensive documentation of all 29
metrics
- [x] Added note about one metrics instance per process
- [x] Updated README.md to reference METRICS.md
- [x] Added METRICS.md to license header exclusion list in
.github/check-license-headers.yaml
- [x] Clean single commit with only documentation changes
- [x] No unrelated PR changes visible in diff

<issue_title>Add comprehensive metrics documentation to
README</issue_title>
> ### Problem
> There is no comprehensive document listing or describing the available
metrics in Firewood. Users and developers cannot easily understand what
metrics are available for monitoring or how to interpret them.
> 
> ### Proposed Solution
> Add a dedicated "Metrics" section to the top-level README.md that
documents all available metrics.
> 
> ### Content to Include
> - List of all available metrics with their names
> - Description of what each metric measures
> - Usage examples for enabling and gathering metrics
> - Information about metric labels and values
> - Examples of how to interpret metrics for monitoring and debugging
> 
> ### Current State
> - Metrics are mentioned briefly in `ffi/README.md`
> - The `firewood-macros/README.md` shows how metrics are instrumented
> - No comprehensive listing exists of what metrics are actually
available
> 
> ### References
> - See `storage/src/macros.rs` for metric macro implementations
> - See `ffi/README.md` for existing metrics documentation
> - See `firewood-macros/README.md` for metrics macro usage

- Fixes ava-labs/firewood#1398

<!-- START COPILOT CODING AGENT SUFFIX -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>Add comprehensive metrics documentation to
README</issue_title>
> <issue_description>### Problem
> There is no comprehensive document listing or describing the available
metrics in Firewood. Users and developers cannot easily understand what
metrics are available for monitoring or how to interpret them.
> 
> ### Proposed Solution
> Add a dedicated "Metrics" section to the top-level README.md that
documents all available metrics.
> 
> ### Content to Include
> - List of all available metrics with their names
> - Description of what each metric measures
> - Usage examples for enabling and gathering metrics
> - Information about metric labels and values
> - Examples of how to interpret metrics for monitoring and debugging
> 
> ### Current State
> - Metrics are mentioned briefly in `ffi/README.md`
> - The `firewood-macros/README.md` shows how metrics are instrumented
> - No comprehensive listing exists of what metrics are actually
available
> 
> ### References
> - See `storage/src/macros.rs` for metric macro implementations
> - See `ffi/README.md` for existing metrics documentation
> - See `firewood-macros/README.md` for metrics macro
usage</issue_description>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> </comments>
> 


</details>

- Fixes ava-labs/firewood#1398

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rkuris <[email protected]>
Co-authored-by: demosdemon <[email protected]>
Co-authored-by: Brandon LeBlanc <[email protected]>
I noticed there was one remaining component using arc-swap. This change
replaces it with a standard Mutex. And, removes the dependency on
arc-swap.
As discussed offline and in #1460, the `ffi-nix` job takes around 15
minutes to complete and runs on every PR. This PR changes it so that the
`ffi-nix` job runs only on commits to `main`.
This PR adds `FjallStore`, an implementation of `RootStore` that allows
for persistent storage of revisions (in contrast to `MockStore`, which
keeps everything in-memory). The main changes of this PR are as follows:

- Adds `FjallStore` implementation
- Extends the `DbConfig` to take in a path to where `FjallStore` will be
stored
- Extends the FFI database config to enable archival support
- Adds `FjallStore` tests in both Rust and Golang
The previous work to defer allocating nodes until commit time had the
side effect of allowing nodes to exceed the maximum allowed size at hash
time. This defers the error until commit time, which is not ideal.

This change updates the test to confirm that the allocator still errors
when trying to allocate a node that is too large. However, we would
prefer to do this earlier but without serializing the node multiple
times.

As part of this change, I created a helper type that the test can use to
verify the exact error was returned instead of comparing against the
string representation of the error. As a part of that, I discovered that
there was some vestiges of old code where the allocator would
potentially return an area larger than necessary when allocating from
the free list. However, that functionality was removed in #1217; thus, I
have simplified the remaining code to reflect that which also simplified
my `giant_node` test changes. Similarly, the removed wasted metric was
no longer relevant because we no longer select free nodes larger than
needed.

I added `DisplayTruncatedHex` to the `Debug` representations of the leaf
and branch nodes to truncate the node values when printing debug info.
This was necessary before fixing the `giant_node` test because the debug
output included the full contents of the node, which would flood the
terminal when the test failed and cause it to hang.

Closes: #1054
Upgrading dependencies included `[email protected]` which pins its
dependency on `generic-array` to `0.14.7`, removing the deprecation
warnings we were seeing that were added to `generic-array` in version
`0.14.9` (where `0.14.8` was yanked).
This doesn't totally complete #1113 since I didn't touch the Proof or
Iterator files. However, this does clarify a lot of the use cases, and
what is valid for concurrent operations.

The best way to review this is open the godocs and reading through it,
commenting on what doesn't seem clear. To reproduce the godocs, run:
```
cd ./ffi
godoc -http=:<PORT_NUM>
```

Please leave nit-picky comments! It would be great if any reader of the
docs would know how to use the code.

Closes #1348, #1413
This PR closes #1030 by migrating from the standard library's `Mutex`
and `RwLock` implementations to `parking_lot` equivalents throughout the
codebase.
With `FjallStore` now merged in, I believe there's not a strong reason
to keep around `MockStore`. This PR removes `MockStore` from Firewood.

This PR also includes cleaning up the `FjallStore` tests by adding a
helper for constructing instances of `Db` with `FjallStore`: 9b0d009.
IMO, the `reopen()` method of `TestDb` is a misnomer - while this method
does not truncate the existing Firewood file, it uses the default
configuration for the new `TestDb` instance, ignoring the prior
instance's configuration. As a consumer, I would expect `reopen()` to
maintain the existing configuration.

This PR does the following:
- Unifies `reopen()` and `reopen_with_config()` into a single method
- Stores the `dbconfig` in `testDb` to allow the database configuration
to persist across reopens
- Adds documentation to both `reopen()` and `replace()`
- Removes `into_root_store()` test helpers since `RootStore` reopens are
now handled via the database configuration

This PR is a precursor to #1481.
…anager` (#1482)

In preparation of moving in-memory revisions to their own data structure
(#1470), this PR moves creation of `RootStore` into the
`RevisionManager` - this move will allow us to inject the in-memory
revisions data structure into `RootStore` during creation.

This PR is a precursor to #1480.
While taking a stab at #1483, I noticed after removing `NoOpStore` in
favor of `Option<T>`, there's actually no need for the `Box` pointer
since, if we have `Some`, it'll always be an instance of `FjallStore`.

This PR does the following:
- Removes the usage of `Box` pointers for `RootStore`
- Removes the `RootStore` trait and renames `FjallStore => RootStore`
- Updates code to use new `RootStore` naming and logic
@maru-ava maru-ava self-assigned this Nov 25, 2025
@alarso16 alarso16 force-pushed the alarso16/migrate-coreth-6-lint branch from 93843f3 to 72eae69 Compare November 25, 2025 19:14
Base automatically changed from alarso16/migrate-coreth-6-lint to maru/migrate-coreth-subtree-merge November 25, 2025 19:15
Base automatically changed from maru/migrate-coreth-subtree-merge to master November 25, 2025 19:45
@github-actions github-actions bot removed the monorepo label Nov 25, 2025
#1488 found a panic during serialization of a node, indicating that a
child node was either not hashed or not allocated and therefore could
not serialize the parent node. After a walk through of the code, I
noticed that after #1389, we began serializing nodes in batches in order
to prevent corrupting the free-list after a failed write and introduced
a case where this could occur. Previously, we would serialize a node,
write it to storage, and then mark it as allocated in one step. Now we
serialize and allocate the node and add the node to a batch before
moving onto the next node. Like before, we only mark the node as
allocated after writing it to disk. This means that if the batch
contained child and parent, the parent node fails to serialize because
it cannot reference the address.

This change refactors the allocation logic to mark nodes as allocated
during batching. This ensures that if its parent is also in the batch,
it can successfully serialize. This linear address itself is not read
from by viewers of the `Allocated` state because the node is not
considered committed until the entire write has completed (i.e., after
the root and header are written). This means this change does not
introduce any early reads of uncommitted data.

In the course of writing tests, I discovered that the arena allocator
returns the number of bytes by the arena, not the number of bytes
occupied by items within the arena. This meant that the batching logic
should have always generated a batch of 1 element. To fix this, I track
the allocated length of each area. This successfully triggered the bug I
discovered; however, this also means the bug might not be the source of
the original panic.
@maru-ava maru-ava linked an issue Dec 1, 2025 that may be closed by this pull request
demosdemon and others added 4 commits December 1, 2025 10:05
The upstream `.golangci.yaml` was updated to relocate the linters block.
This commit updates our `.golangci.yaml` to do the same and updates the
patch file to reflect this change.

Upstream: <#4535>
git-subtree-dir: firewood
git-subtree-mainline: 136dba8
git-subtree-split: dd9cb9f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Spike on Bazel as a precondition for Firewood migration