Ethereum substate recorder/replayer based on the paper:
Yeonsoo Kim, Seongho Jeong, Kamil Jezek, Bernd Burgstaller, and Bernhard Scholz: An Off-The-Chain Execution Environment for Scalable Testing and Profiling of Smart Contracts, USENIX ATC'21
To build our recorder (geth record-substate
) and replayer (substate-cli
), run make record-replay
.
Running make rr
or make
works exactly same as make record-replay
.
You can find all executables including geth
and our substate-cli
in build/bin/
directory.
Running with --version
from record-replay will print both the Geth version and the record-replay version.
For example, ./geth --version
will print geth version 1.13.15-rr0.5.0-commit
which indicates rr0.5.0
is based on Geth v1.13.15
.
Check CHANGELOG.md for release notes.
Here is a simple way how to record substates.
- Sync
geth
up to the block that you want to record and replay. Geth full/snap sync will download blocks from the genesis block. To sync blocks after the PoS update, you need to simultaneously run a consensus layer client. Visit Nodes and Clients page for more details. - Export blocks using
geth export
command. For example,geth export ethereum.blockchain
to export from the genesis block to the latest synced block. - Import the exported blocks to record substates using
geth record-substate
command from scratch. For example,geth record-substate ethereum.blockchain
to import and record the exported blocks from the unmodified Geth.
If you want to record a specific range of blocks (X)-(Y)
, you need the Geth database specified by --datadir
whose head block is (X-1)
, and blocks (X)-(Y)
exported to a file.
If you don't have the Geth database at block (X-1)
, then you need to export blocks up to (X-1)
, and import it from scratch again.
The output directory specified by --substatedir
(default: substate.ethereum
) is the substate DB.
The directory is a single LevelDB instance, so you must read or write the substate DB with github.com/syndtr/goleveldb
module.
The substate DB may be corrupted if you directly write or modify any files in the directory.
Do not use LevelDB library for other languages (C++, Python, etc.) because they are incompatible with the goleveldb module.
Our recorder requires more memory to test faithful replay while writing substates to substate DB.
Therefore, it is recommended to have 32GB RAM for recording.
If you want to run without testing faithful replay, use --skip-check-replay
option.
Our geth record-substate
command is based on geth import
full sync. You may want to try different options of geth import
such as --snapshot
, --db.engine
, and --state.scheme
to improve full sync speed and size. The --datadir
path after geth record-substate
will be at the last block of the imported chain same as geth import
.
For example, if you want to record from block 2_000_001
to 3_000_000
:
# Ctrl+C to stop syncing when geth finished sync blocks to record
./geth --datadir datadir-1
# Export blockchain files 0-2M, 2-3M
./geth export --datadir datadir-1 0-2M.blockchain 1 2000000
./geth export --datadir datadir-1 2-3M.blockchain 2000001 3000000
# Import blockchain files up to 2M
./geth import --datadir datadir-2 0-2M.blockchain
# Record substates from 2M to 3M
./geth record-substate --datadir datadir-2 2-3M.blockchain
Since rr0.2
, a substate database contains two types of key-value pairs in one goleveldb instance.
The first 2 bytes of a key in a substate DB represent different data types as follows:
1s
: Substate, a key is"1s"+N+T
with transaction indexT
at blockN
.T
andN
are encoded in a big-endian 64-bit binary.1c
: EVM bytecode, a key is"1c"+codeHash
wherecodeHash
is Keccak256 hash of the bytecode.
A goleveldb instance is the path of the directory that contains *.ldb
files.
Copying or overwriting *.ldb
does not merge two instances but corrupts the written one.
The goleveldb module and the official C++ LevelDB implementation are not compatible with each other.
Therefore, you need to write a Go program to properly read and write goleveldb instances.
substate-cli replay
executes transaction substates in a given block range.
If substate-cli replay
finds an execution result that is not equivalent to the recorded result,
it returns an error immediately.
For example, if you want to replay transactions from block 1,000,001 to block 2,000,000 in substate.ethereum
:
./substate-cli replay --block-segment 1000001-2000000
In --block-segment
, you can use _
as a digit separator in block segment like 1_000_001-2_000_000
.
You can use SI unit suffix k
and M
to --block-segment
for shorter notations like 1_000-2_000k
or 1-2M
.
./substate-cli replay --block-segment 1-2M
NOTE: When use k
or M
to --block-segment
, the last digit of the first block number is 1
, not 0
, as described in the above examples.
Here are command line options for substate-cli replay
:
NAME:
substate-cli replay - replay transactions and check output consistency
USAGE:
substate-cli replay [command options] [arguments...]
CATEGORY:
replay
DESCRIPTION:
substate-cli replay executes transactions in the given block segment
and check output consistency for faithful replaying.
OPTIONS:
--block-segment value
Single block segment (e.g. 1001, 1_001, 1_001-2_000, 1-2k, 1-2M)
--skip-call-txs (default: false)
Skip executing CALL transactions to accounts with contract bytecode
--skip-create-txs (default: false)
Skip executing CREATE transactions
--skip-transfer-txs (default: false)
Skip executing transactions that only transfer ETH
--substatedir value, --substate-db value (default: "substate.ethereum")
Data directory for substate recorder/replayer
--workers value (default: 4)
Number of worker threads (goroutines), 0 for current CPU physical cores
For example, if you want 32 workers to replay transactions except CREATE transactions:
./substate-cli replay --block-segment 1-2M --workers 32 --skip-create-txs
If you want to replay only CALL transactions and skip the other types of transactions:
./substate-cli replay --block-segment 1-2M --skip-transfer-txs --skip-create-txs
If you want to use a substate DB other than substate.ethereum
(e.g. /path/to/substate_db
):
./substate-cli replay --block-segment 1-2M --substatedir /path/to/substate_db
To assess hard-forks with prior transactions, use substate-cli replay-fork
command. Run ./substate-cli replay-fork --help
for more details:
NAME:
substate-cli replay-fork - replay transactions with the given hard fork and compare results
USAGE:
substate-cli replay-fork [command options] [arguments...]
CATEGORY:
replay
DESCRIPTION:
substate-cli replay executes transactions in the given block segment
with the given hard fork config and report output comparison results.
OPTIONS:
--block-segment value
Single block segment (e.g. 1001, 1_001, 1_001-2_000, 1-2k, 1-2M)
--hard-fork value (default: 17034870)
Hard-fork block number, won't change block number in BlockEnv for NUMBER
instruction, 1: Frontier, 1150000: Homestead, 2463000: Tangerine Whistle,
2675000: Spurious Dragon, 4370000: Byzantium, 7280000: Constantinople +
Petersburg, 9069000: Istanbul, 12244000: Berlin, 12965000: London, 15537394:
Paris (The Merge), 17034870: Shanghai
--skip-call-txs (default: false)
Skip executing CALL transactions to accounts with contract bytecode
--skip-create-txs (default: false)
Skip executing CREATE transactions
--skip-transfer-txs (default: false)
Skip executing transactions that only transfer ETH
--substatedir value, --substate-db value (default: "substate.ethereum")
Data directory for substate recorder/replayer
--workers value (default: 4)
Number of worker threads (goroutines), 0 for current CPU physical cores
substate-cli db-*
commands are additional commands to directly manipulate substate DBs.
substate-cli db-rr0.3-to-rr0.4
(aliased to substate-cli db-rlp2proto
) command converts the old rr0.3 DB layout (RLP) to the rr0.4 DB layout (Protobuf).
To guarantee faithful replay after upgrading, db-rr0.3-to-rr0.4
replays the converted substates before writing them to the new substate DB.
--blockchain
option can be used to supplement tx types which are required for rr0.4 substates but missing in rr0.3 substates.
./substate-cli db-rr0.3-to-rr0.4 --old-path rr0.3.substate.ethereum --new-path rr0.4.substate.ethereum --blockchain 0-1M.blockchain --block-segment 0-1M --workers 0
If --blockchain
is not provided or substates are not found in the provided blockchain file, then substate db-rr0.3-to-rr0.4
will guess tx types based on access lists and dynamic gas fees.
./substate-cli db-rr0.3-to-rr0.4 --old-path rr0.3.substate.ethereum --new-path rr0.4.substate.ethereum --block-segment 0-1M --workers 0
substate-cli db-clone
command reads substates of a given block range and copies them in a substate DB clone.
./substate-cli db-clone --src-path srcdb --dst-path dstdb --block-segment 1-2M --workers 0
substate-cli db-compact
command runs compaction functionality of the backend of the given substate DB.
./substate-cli db-compact --substatedir substate.ethereum
substate-cli db-export
command exports substates to files.
--json
option to export base64 JSON instead of binary.
--hashed
option to replace raw bytecode to code hash when export.
For hashed substates, read through the next sections for more details.
./substate-cli db-export --substatedir substate.ethereum --out-dir substate-db-export --block-segment 1-2M --workers 0
The exported files are named after their block number and tx index.
For example, the substate file at tx index 0 at block 1,000,000 has 1000000_0
in its name.
Since rr0.4
, the substate recorder and replayer use Protobuf for serialization for better compatibility and productivity with Go and other major programming languages such as C++, Java, and Python.
Until rr0.3
, the substate recorder and replayer used Ethereum RLP encoding for serialization of substate data structures.
To use Protobuf, it is required to compile Protobuf definition to the source code of Go or other languages.
For Protobuf support in Go, install both protoc
and protoc-gen-go
as described in this link.
Executing ./protoc-substate.sh
will read substate.pb
and generate substate.pb.go
file.
Protobuf named its data structure as message.
Substate
defines the following 5 messages:
Account
: account information (nonce, balance, one of code or code hash, storage)Alloc
: mapping of account addresses andAccount
messages.BlockEnv
: information from block headers (block gas limit, number, timestamp, block hashes)TxMessage
: transaction message for executionResult
: result of transaction execution
Substate
contains the following 5 values required to replay transactions and validate results:
input_alloc
: alloc that is read during transaction executionblock_env
: block information required for transaction executiontx_message
: array with exactly 1 transactionoutput_alloc
: alloc that is generated by transaction executionresult
: execution result and receipt array with exactly 1 receipt
substate_utils.go defines helper functions to convert between data structures in Geth and Protobuf.
Contract code in Account
and initialization code in TxMessage
are defined as oneof
raw bytecode (i.e., unhashed) or Keccak256 code hash (i.e., hashed) in substate.proto.
The main purpose of converting unhashed substates to hashed substates is to reduce the size of substates by replacing lengthy bytecode with fixed-size code hash.
substate_utils.go
defines HashMap
, HashKeys
, HashedCopy
, and UnhashedCopy
helper functions for conversion between unhashed substates and hashed substates.
You may want to modify our recorder to capture more information into substates.
Our recorder tests faithful transaction replay with every substate before it writes them to substate DB.
If this test fails, the recorder immediately stops and stores substates into JSON files named after block_tx
.
All byte arrays in the substate JSON files are encoded in Base64 (unlike Geth using hex for bytes in its JSON files).
You can manually inspect what makes the replay test fail using diff
command, or programmatically by writing a script that compares the JSON files using JSON libraries.
You may instrument EVM in our replayer instead of the P2P client to speed up dynamic analysis on EVM bytecode.
In this case, modify and run substate-cli replay
which checks the EVM output with the recorded output.
If those two outputs are different in substate-cli replay
, it will print substates formatted in JSON and terminate.
In this case, use --workers=1
option for sequential transaction execution and identify the block N that causes the problem.
Then, add code that prints values for debugging or use debugging tools.
You may modify EVM specification and want to see the effects of the updates.
It also means that you don't expect that all transactions are faithfully replayed.
In this case, modify and run substate-cli replay-fork
which checks differences in the outputs and reports them.