Skip to content

Conversation

@klauspost
Copy link
Collaborator

@klauspost klauspost commented Jan 3, 2026

Only tested on QEMU.

See perf on real HW below.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds ARM64 assembly implementation for decompression in the minlz package. The main goal is to provide optimized decompression on ARM64 architectures to match the existing AMD64 assembly support. The implementation includes comprehensive test coverage and updates CI configuration to test on ARM64 runners.

Key changes:

  • New ARM64 assembly decoder with fast and slow loops, NEON SIMD optimizations, and special handling for overlapping copies
  • Comprehensive test suite with edge cases for overlapping copies, long offsets, and various data patterns
  • CI updates to include ARM64 testing and newer toolchain versions

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
decode_test.go Adds comprehensive tests for ARM64 decoder covering edge cases, overlapping copies, random data, and regression testing
decode_other.go Updates build tags to exclude ARM64 from the Go fallback implementation
decode_arm64.go ARM64-specific wrapper that calls the assembly implementation with race detection support
asm_arm64.go Go stub declaration for the ARM64 assembly decoder function
asm_arm64.s Complete ARM64 assembly implementation with ~986 lines covering all decompression tag types and copy operations
.github/workflows/release.yml Updates Go version to 1.25.x and goreleaser to 2.13.2
.github/workflows/go.yml Adds ARM64 runner, updates Go versions to 1.25.x, updates goreleaser, and expands fuzz test matrix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@RaduBerinde
Copy link

This is great! I will do some testing with CockroachDB.

@RaduBerinde
Copy link

Also, I think GitHub actions support arm: https://github.com/orgs/community/discussions/148648

@klauspost
Copy link
Collaborator Author

@klauspost
Copy link
Collaborator Author

@RaduBerinde I would definitely be happy if you could verify that it is at least on par with the Go code.

Probably the easiest would be to build the mz tool.

For example using this as a testset:
7mb-cockroach-db.log.mzb.gz.

λ go build -tags=noasm&&./mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.907s, 5044.8MB/s

Decompressing block (16 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.969s, 29949.9MB/s (5.9x)

λ go build -tags=&&./mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.883s, 8231.6MB/s

Decompressing block (16 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.845s, 30352.4MB/s (3.7x)

(here multithreaded just hits memory bandwidth, but single thread should be fine).

@RaduBerinde
Copy link

This is on an Apple M1 laptop:

❯ go run -tags=noasm ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.885s, 4160.9MB/s

Decompressing block (10 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.901s, 32239.8MB/s (7.7x)

❯ go run ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.878s, 4673.1MB/s

Decompressing block (10 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.972s, 35766.4MB/s (7.7x)

I will also check on a GCE arm machine.

@klauspost
Copy link
Collaborator Author

👍🏼 Small improvement. I will have some ARM hardware available for further investigation - I can test on a wider data set.

Mostly just making sure it wasn't a regression.

@klauspost klauspost changed the title exp: Add arm64 decompression assembly perf: Add arm64 decompression assembly Jan 7, 2026
@RaduBerinde
Copy link

On T2A (older):

ubuntu@radu-minlz-t2a-standard-4-0001:~/minlz$ go run -tags=noasm ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
go: downloading github.com/klauspost/compress v1.17.11
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.941s, 2618.3MB/s

Decompressing block (4 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.849s, 10043.5MB/s (3.8x)
ubuntu@radu-minlz-t2a-standard-4-0001:~/minlz$ go run ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.913s, 2947.0MB/s

Decompressing block (4 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.834s, 11338.6MB/s (3.8x)

On C4A (newer):

ubuntu@radu-minlz-c4a-0001:~/minlz$ go run -tags=noasm ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
go: downloading github.com/klauspost/compress v1.17.11
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.913s, 4054.3MB/s

Decompressing block (4 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.839s, 15934.1MB/s (3.9x)
ubuntu@radu-minlz-c4a-0001:~/minlz$ go run ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.881s, 4955.4MB/s

Decompressing block (4 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.84s, 19486.9MB/s (3.9x)

For comparison, this is what a recent x86 (C4) looks like:

ubuntu@radu-tpcc2-0001:~/minlz$ go run -tags=noasm ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
go: downloading github.com/klauspost/compress v1.17.11
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.929s, 3959.6MB/s

Decompressing block (32 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.871s, 80850.6MB/s (20.4x)
ubuntu@radu-tpcc2-0001:~/minlz$ go run ./cmd/mz d -bench=10 -block 7mb-cockroach-db.log.mzb
Reading 7mb-cockroach-db.log.mzb...

Decompressing Block using 1 thread...
 * 568610 -> 8247494 bytes [1450.47%]; 9.881s, 5833.4MB/s

Decompressing block (32 threads)...
 * 568610 -> 8247494 bytes [6.89%]; 9.913s, 112799.6MB/s (19.3x)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants