A Tutorial of RocksDB SST formats

What is SST (Static Sorted Table)

All RocksDB's persistent data is stored in a collection of SSTs. We often use sst, table and sst file interchangeably.

Choosing a table format

Rocksdb supports different types of SST formats, but how to choose the table format that fits your need best?

Right now we have two types of tables: "plain table" and "block based table".

Block-based table

This is the default table type that we inherited from LevelDB, which was designed for storing data in hard disk or flash device.

In block-based table, data is chucked into (almost) fix-sized blocks (default block size is 4k). Each block, in turn, keeps a bunch of entries.

When storing data, we can compress and/or encode data efficiently within a block, which often resulted in a much smaller data size compared with the raw data size.

As for the record retrieval, we'll first locate the block where target record may reside, then read the block to memory, and finally search that record within the block. Of course, to avoid frequent reads of the same block, we introduced the block cache to keep the loaded blocks in the memory.

For more information about block-based table, please read this wiki: Rocksdb BlockBasedTable Format.

Plain table

Block-based table is proven to be efficient when store data in hard disk or flash device. However, for applications that requires low-latency in-memory database, a better alternative emerges: plain table.

Plain table, as its name suggests, stores data in a sequence of key/value pairs. But several features make plain table have not-so-plain (read "excellent") performance when serving as the module of in-memory database:

No memory copy needed. As part of in-memory database, we can easily mmap a plain table and allows direct access to its data without copying. Also plain table bypasses the concept of "block" and therefore avoids the overhead inherent in block-based table, like extra block lookup, block cache, etc.
Faster Hash-based index. Compared with block-based table, which employs mostly binary search for entry lookup, the well designed hash-based index in plain table enables us to locate data magnitudes faster.

Of course, currently there're some limitations for this plain table format (more details please see the link provide below):

File size may not be greater than 2^31 - 1 (i.e., 2147483647) bytes.
Data compression/Delta encoding is not supported, which may resulted in bigger file size compared with block-based table.
Backward (Iterator.Prev()) scan is not supported.
Non-prefix-based Seek() is not supported
Table loading is slower since indexes are built on the fly by 2-pass table scanning.
Only support mmap mode.

For more information about plain table, please read this wiki: PlainTable Format.

Comparison of SSTs

TBD

Examples

Block-based table

By default, a database uses block-based table.

#include "rocksdb/db.h"
rocksdb::DB* db;
// Get a db with block-based table without any change.
rocksdb::DB::Open(rocksdb::Options(), "/tmp/testdb", &db);

For a more customized block-based table:

#include "rocksdb/db.h"
// rocksdb/table.h includes all supported tables.
#include "rocksdb/table.h"

rocksdb::DB* db;
rocksdb::Options options;
options.table_factory.reset(NewBlockBasedTableFactory());
options.block_size = 4096; /* block size for the block-based table */
rocksdb::DB::Open(options, "/tmp/testdb", &db);

Plain table

For plain table, the process is similar:

#include "rocksdb/db.h"
// rocksdb/table.h includes all supported tables.
#include "rocksdb/table.h"

rocksdb::DB* db;
rocksdb::Options options;
// To enjoy the benefits provided by plain table, you have to enable
// allow_mmap_reads for plain table.
options.allow_mmap_reads = true;
// plain table will extract the prefix from a key. The prefix will be
// used for the calculating hash code, which will be used in hash-based
// index.
// Unlike Prefix_extractor is a raw pointer, please remember to delete it
// after use.
SliceTransform* prefix_extractor = new NewFixedPrefixTransform(8);
options.prefix_extractor = prefix_extractor;
options.table_factory.reset(NewPlainTableFactory(
    // plain table has optimization for fix-sized keys, which can be
    // specified via user_key_len.  Alternatively, you can pass
    // `kPlainTableVariableLength` if your keys have variable lengths.
    8,
    // For advanced users only. 
    // Bits per key for plain table's bloom filter, which helps rule out non-existent
    // keys faster. If you want to disable it, simply pass `0`.
    // Default: 10.
    10,
    // For advanced users only.
    // Hash table ratio. the desired utilization of the hash table used for prefix
    // hashing. hash_table_ratio = number of prefixes / #buckets in the hash table.
    0.75
));
rocksdb::DB::Open(options, "/tmp/testdb", &db);
...
delete prefix_extractor;

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Tutorial of RocksDB SST formats

What is SST (Static Sorted Table)

Choosing a table format

Block-based table

Plain table

Comparison of SSTs

Examples

Block-based table

Plain table

Clone this wiki locally