Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 9.0.1
9.0.1 (2024-04-11)
Bug Fixes
- Fixed CMake Javadoc and source jar builds
- Fixed Java
SstFileMetaData
to prevent throwingjava.lang.NoSuchMethodError
RocksDB 8.11.4
8.11.4 (2024-04-09)
Bug Fixes
- Fixed CMake Javadoc build
- Fixed Java
SstFileMetaData
to prevent throwingjava.lang.NoSuchMethodError
RocksDB 9.0.0
9.0.0 (2024-02-16)
New Features
- Provide support for FSBuffer for point lookups. Also added support for scans and compactions that don't go through prefetching.
- *Make
SstFileWriter
create SST files without persisting user defined timestamps when theOption.persist_user_defined_timestamps
flag is set to false. - Add support for user-defined timestamps in APIs
DeleteFilesInRanges
andGetPropertiesOfTablesInRange
. - Mark wal_compression feature as production-ready. Currently only compatible with ZSTD compression.
Public API Changes
- Allow setting Stderr logger via C API
- Declare one Get and one MultiGet variant as pure virtual, and make all the other variants non-overridable. The methods required to be implemented by derived classes of DB allow returning timestamps. It is up to the implementation to check and return an error if timestamps are not supported. The non-batched MultiGet APIs are reimplemented in terms of batched MultiGet, so callers might see a performance improvement.
- Exposed mode option to Rate Limiter via c api.
- Removed deprecated option
access_hint_on_compaction_start
- Removed deprecated option
ColumnFamilyOptions::check_flush_compaction_key_order
- *Remove the default
WritableFile::GetFileSize
andFSWritableFile::GetFileSize
implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation. - Removed deprecated option
ColumnFamilyOptions::level_compaction_dynamic_file_size
- *Removed tickers with typos "rocksdb.error.handler.bg.errro.count", "rocksdb.error.handler.bg.io.errro.count", "rocksdb.error.handler.bg.retryable.io.errro.count".
- Remove the force mode for
EnableFileDeletions
API because it is unsafe with no known legitimate use. - Removed deprecated option
ColumnFamilyOptions::ignore_max_compaction_bytes_for_input
sst_dump --command=check
now compares the number of records in a table withnum_entries
in table property, and reports corruption if there is a mismatch. APISstFileDumper::ReadSequential()
is updated to optionally do this verification. (#12322)
Behavior Changes
- format_version=6 is the new default setting in BlockBasedTableOptions, for more robust data integrity checking. DBs and SST files written with this setting cannot be read by RocksDB versions before 8.6.0.
- Compactions can be scheduled in parallel in an additional scenario: multiple files are marked for compaction within a single column family
- For leveled compaction, RocksDB will try to do intra-L0 compaction if the total L0 size is small compared to Lbase (#12214). Users with atomic_flush=true are more likely to see the impact of this change.
Bug Fixes
- Fixed a data race in
DBImpl::RenameTempFileToOptionsFile
. - Fix some perf context statistics error in write steps. which include missing write_memtable_time in unordered_write. missing write_memtable_time in PipelineWrite when Writer stat is STATE_PARALLEL_MEMTABLE_WRITER. missing write_delay_time when calling DelayWrite in WriteImplWALOnly function.
- Fixed a bug that can, under rare circumstances, cause MultiGet to return an incorrect result for a duplicate key in a MultiGet batch.
- Fix a bug where older data of an ingested key can be returned for read when universal compaction is used
RocksDB 8.11.3
8.11.3 (2024-02-27)
- Correct CMake Javadoc and source jar builds
8.11.2 (2024-02-16)
- Update zlib to 1.3.1 for Java builds
8.11.1 (2024-01-25)
Bug Fixes
- Fix a bug where older data of an ingested key can be returned for read when universal compaction is used
- Apply appropriate rate limiting and priorities in more places.
8.11.0 (2024-01-19)
New Features
- Add new statistics:
rocksdb.sst.write.micros
measures time of each write to SST file;rocksdb.file.write.{flush|compaction|db.open}.micros
measure time of each write to SST table (currently only block-based table format) and blob file for flush, compaction and db open.
Public API Changes
- Added another enumerator
kVerify
to enum classFileOperationType
in listener.h. Update yourswitch
statements as needed. - Add CompressionOptions to the CompressedSecondaryCacheOptions structure to allow users to specify library specific options when creating the compressed secondary cache.
- Deprecated several options:
level_compaction_dynamic_file_size
,ignore_max_compaction_bytes_for_input
,check_flush_compaction_key_order
,flush_verify_memtable_count
,compaction_verify_record_count
,fail_if_options_file_error
, andenforce_single_del_contracts
- Exposed options ttl via c api.
Behavior Changes
rocksdb.blobdb.blob.file.write.micros
expands to also measure time writing the header and footer. Therefore the COUNT may be higher and values may be smaller than before. For stacked BlobDB, it no longer measures the time of explictly flushing blob file.- Files will be compacted to the next level if the data age exceeds periodic_compaction_seconds except for the last level.
- Reduced the compaction debt ratio trigger for scheduling parallel compactions
- For leveled compaction with default compaction pri (kMinOverlappingRatio), files marked for compaction will be prioritized over files not marked when picking a file from a level for compaction.
Bug Fixes
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where
rocksdb.blobdb.blob.file.synced
includes blob files failed to get synced androcksdb.blobdb.blob.file.bytes.written
includes blob bytes failed to get written. - Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with
BackupEngine
,sst_dump
, orldb
. - Fixed issues with experimental
preclude_last_level_data_seconds
option that could interfere with expected data tiering. - Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.
RocksDB 8.10.2
8.10.2 (2024-02-16)
- Update zlib to 1.3.1 for Java builds
8.10.1 (2024-01-16)
Bug Fixes
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
RocksDB 8.10.0
8.10.0 (2023-12-15)
New Features
- Provide support for async_io to trim readahead_size by doing block cache lookup
- Added initial wide-column support in
WriteBatchWithIndex
. This includes thePutEntity
API and support for wide columns in the existing read APIs (GetFromBatch
,GetFromBatchAndDB
,MultiGetFromBatchAndDB
, andBaseDeltaIterator
).
Public API Changes
- Custom implementations of
TablePropertiesCollectorFactory
may now return anullptr
collector to decline processing a file, reducing callback overheads in such cases.
Behavior Changes
- Make ReadOptions.auto_readahead_size default true which does prefetching optimizations for forward scans if iterate_upper_bound and block_cache is also specified.
- Compactions can be scheduled in parallel in an additional scenario: high compaction debt relative to the data size
- HyperClockCache now has built-in protection against excessive CPU consumption under the extreme stress condition of no (or very few) evictable cache entries, which can slightly increase memory usage such conditions. New option
HyperClockCacheOptions::eviction_effort_cap
controls the space-time trade-off of the response. The default should be generally well-balanced, with no measurable affect on normal operation.
Bug Fixes
- Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
- Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
- Fix a WRITE_STALL counter that was reporting wrong value in few cases.
- A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.
Performance Improvements
- Java API extensions to improve consistency and completeness of APIs
- Extended
RocksDB.get([ColumnFamilyHandle columnFamilyHandle,] ReadOptions opt, ByteBuffer key, ByteBuffer value)
which now accepts indirect buffer parameters as well as direct buffer parameters - Extended
RocksDB.put( [ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOpts, final ByteBuffer key, final ByteBuffer value)
which now accepts indirect buffer parameters as well as direct buffer parameters - Added
RocksDB.merge([ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOptions, ByteBuffer key, ByteBuffer value)
methods with the same parameter options asput(...)
- direct and indirect buffers are supported - Added
RocksIterator.key( byte[] key [, int offset, int len])
methods which retrieve the iterator key into the supplied buffer - Added
RocksIterator.value( byte[] value [, int offset, int len])
methods which retrieve the iterator value into the supplied buffer - Deprecated
get(final ColumnFamilyHandle columnFamilyHandle, final ReadOptions readOptions, byte[])
in favour ofget(final ReadOptions readOptions, final ColumnFamilyHandle columnFamilyHandle, byte[])
which has consistent parameter ordering with other methods in the same class - Added
Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value)
methods which retrieve the requested value into the supplied buffer - Added
Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value)
methods which retrieve the requested value into the supplied buffer - Added
Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value, boolean exclusive [, boolean doValidate])
methods which retrieve the requested value into the supplied buffer - Added
Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value, boolean exclusive [, boolean doValidate])
methods which retrieve the requested value into the supplied buffer - Added
Transaction.getIterator()
method as a convenience which defaults theReadOptions
value supplied to existingTransaction.iterator()
methods. This mirrors the existingRocksDB.iterator()
method. - Added
Transaction.put([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked])
methods which supply the key, and the value to be written in aByteBuffer
parameter - Added
Transaction.merge([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked])
methods which supply the key, and the value to be written/merged in aByteBuffer
parameter - Added
Transaction.mergeUntracked([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value)
methods which supply the key, and the value to be written/merged in aByteBuffer
parameter
- Extended
RocksDB 8.9.1
8.9.1 (2023-12-08)
Bug Fixes
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
8.9.0 (2023-11-17)
New Features
- Add GetEntity() and PutEntity() API implementation for Attribute Group support. Through the use of Column Families, AttributeGroup enables users to logically group wide-column entities.
Public API Changes
- Added rocksdb_ratelimiter_create_auto_tuned API to create an auto-tuned GenericRateLimiter.
- Added clipColumnFamily() to the Java API to clip the entries in the CF according to the range [begin_key, end_key).
- Make the
EnableFileDeletion
API not default to force enabling. For users that rely on this default behavior and still
want to continue to use force enabling, they need to explicitly pass atrue
toEnableFileDeletion
. - Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.
Behavior Changes
- During off-peak hours defined by
daily_offpeak_time_utc
, the compaction picker will select a larger number of files for periodic compaction. This selection will include files that are projected to expire by the next off-peak start time, ensuring that these files are not chosen for periodic compaction outside of off-peak hours. - If an error occurs when writing to a trace file after
DB::StartTrace()
, the subsequent trace writes are skipped to avoid writing to a file that has previously seen error. In this case,DB::EndTrace()
will also return a non-ok status with info about the error occured previously in its status message. - Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
- Make RocksDB only call
TablePropertiesCollector::Finish()
once. - When
WAL_ttl_seconds > 0
, we now process archived WALs for deletion at least everyWAL_ttl_seconds / 2
seconds. Previously it could be less frequent in case of smallWAL_ttl_seconds
values when size-based expiration (WAL_size_limit_MB > 0
) was simultaneously enabled.
Bug Fixes
- Fixed a crash or assertion failure bug in experimental new HyperClockCache variant, especially when running with a SecondaryCache.
- Fix a race between flush error recovery and db destruction that can lead to db crashing.
- Fixed some bugs in the index builder/reader path for user-defined timestamps in Memtable only feature.
RocksDB 8.8.1
8.8.1 (2023-11-17)
Bug fixes
- Make the cache memory reservation accounting in Tiered cache (primary and compressed secondary cache) more accurate to avoid over/under charging the secondary cache.
- Allow increasing the compressed_secondary_ratio in the Tiered cache after setting it to 0 to disable.
8.8.0 (2023-10-23)
New Features
- Introduce AttributeGroup by adding the first AttributeGroup support API, MultiGetEntity(). Through the use of Column Families, AttributeGroup enables users to logically group wide-column entities. More APIs to support AttributeGroup will come soon, including GetEntity, PutEntity, and others.
- Added new tickers
rocksdb.fifo.{max.size|ttl}.compactions
to count FIFO compactions that drop files for different reasons - Add an experimental offpeak duration awareness by setting
DBOptions::daily_offpeak_time_utc
in "HH:mm-HH:mm" format. This information will be used for resource optimization in the future - Users can now change the max bytes granted in a single refill period (i.e, burst) during runtime by
SetSingleBurstBytes()
for RocksDB rate limiter
Public API Changes
- The default value of
DBOptions::fail_if_options_file_error
changed fromfalse
totrue
. Operations that set in-memory options (e.g.,DB::Open*()
,DB::SetOptions()
,DB::CreateColumnFamily*()
, andDB::DropColumnFamily()
) but fail to persist the change will now return a non-OKStatus
by default. - Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.
Behavior Changes
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0 - During a write stop, writes now block on in-progress recovery attempts
- Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
Bug Fixes
- Fix a bug in auto_readahead_size where first_internal_key of index blocks wasn't copied properly resulting in corruption error when first_internal_key was used for comparison.
- Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
- Add bounds check in WBWIIteratorImpl and make BaseDeltaIterator, WriteUnpreparedTxn and WritePreparedTxn respect the upper bound and lower bound in ReadOption. See 11680.
- Fixed the handling of wide-column base values in the
max_successive_merges
logic. - Fixed a rare race bug involving a concurrent combination of Create/DropColumnFamily and/or Set(DB)Options that could lead to inconsistency between (a) the DB's reported options state, (b) the DB options in effect, and (c) the latest persisted OPTIONS file.
- Fixed a possible underflow when computing the compressed secondary cache share of memory reservations while updating the compressed secondary to total block cache ratio.
Performance Improvements
- Improved the I/O efficiency of DB::Open a new DB with
create_missing_column_families=true
and many column families.
RocksDB 8.7.3
8.7.3 (2023-10-30)
Behavior Changes
- Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
8.7.2 (2023-10-25)
Public API Changes
- Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.
Bug Fixes
- Fixed a possible underflow when computing the compressed secondary cache share of memory reservations while updating the compressed secondary to total block cache ratio.
- Fix an assertion failure when UpdeteTieredCache() is called in an idempotent manner.
8.7.1 (2023-10-20)
Bug Fixes
- Fix a bug in auto_readahead_size where first_internal_key of index blocks wasn't copied properly resulting in corruption error when first_internal_key was used for comparison.
- Add bounds check in WBWIIteratorImpl and make BaseDeltaIterator, WriteUnpreparedTxn and WritePreparedTxn respect the upper bound and lower bound in ReadOption. See 11680.
8.7.0 (2023-09-22)
New Features
- Added an experimental new "automatic" variant of HyperClockCache that does not require a prior estimate of the average size of cache entries. This variant is activated when HyperClockCacheOptions::estimated_entry_charge = 0 and has essentially the same concurrency benefits as the existing HyperClockCache.
- Add a new statistic
COMPACTION_CPU_TOTAL_TIME
that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running. - Add
GetEntity()
API for ReadOnly DB and Secondary DB. - Add a new iterator API
Iterator::Refresh(const Snapshot *)
that allows iterator to be refreshed while using the input snapshot to read. - Added a new read option
merge_operand_count_threshold
. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcodekMergeOperandThresholdExceeded
. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction. - For
NewRibbonFilterPolicy()
, made thebloom_before_level
option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments onNewRibbonFilterPolicy()
- RocksDB now allows the block cache to be stacked on top of a compressed secondary cache and a non-volatile secondary cache, thus creating a three-tier cache. To set it up, use the
NewTieredCache()
API in rocksdb/cache.h.. - Added a new wide-column aware full merge API called
FullMergeV3
toMergeOperator
.FullMergeV3
supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back toFullMergeV2
. If the base value is a wide-column entity, the default implementation invokesFullMergeV2
to perform the merge on the default column, and leaves any other columns unchanged. - Add wide column support to ldb commands (scan, dump, idump, dump_wal) and sst_dump tool's scan command
Public API Changes
- Expose more information about input files used in table creation (if any) in
CompactionFilter::Context
. SeeCompactionFilter::Context::input_start_level
,CompactionFilter::Context::input_table_properties
for more. Options::compaction_readahead_size
's default value is changed from 0 to 2MB.- When using LZ4 compression, the
acceleration
parameter is configurable by setting the negated value inCompressionOptions::level
. For example,CompressionOptions::level=-10
will setacceleration=10
- The
NewTieredCache
API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified inTieredCacheOptions
. Any capacity specified inLRUCacheOptions
,HyperClockCacheOptions
andCompressedSecondaryCacheOptions
is ignored. A new API,UpdateTieredCache
is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. - The
NewTieredVolatileCache()
API in rocksdb/cache.h has been renamed toNewTieredCache()
.
Behavior Changes
- Compaction read performance will regress when
Options::compaction_readahead_size
is explicitly set to 0 - Universal size amp compaction will conditionally exclude some of the newest L0 files when selecting input with a small negative impact to size amp. This is to prevent a large number of L0 files from being locked by a size amp compaction, potentially leading to write stop with a few more flushes.
- Change ldb scan command delimiter from ':' to '==>'.
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
- Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
- Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
. - Fix an assertion fault during seek with async_io when readahead trimming is enabled.
- When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing
LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead. - Updating the tiered cache (cache allocated using NewTieredCache()) by calling SetCapacity() on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But SetCapacity() would just set the primary cache capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, SetCapacity() will distribute the new capacity across the two caches by the same ratio.
- Fixed a bug in
MultiGet
for cleaning up SuperVersion acquired with locking db mutex. - Fix a bug where row cache can falsely return kNotFound even though row cache entry is hit.
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests - Fix a bug (Issue #10257) where DB can hang after write stall since no compaction is scheduled (#11764).
- Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
- Fix an issue in sst dump tool to handle bounds specified for data with user-defined timestamps.
- When auto_readahead_size is enabled, update readahead upper bound during readahead trimming when reseek changes iterate_upper_bound dynamically.
- Fixed a bug where
rocksdb.file.read.verify.file.checksums.micros
is not populated - Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Performance Improvements
- Added additional improvements in tuning readahead_size during Scans when auto_readahead_size is enabled. However it's not recommended for backward scans and might impact the performance. More details in options.h.
- During async_io, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, it tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase that would save some CPU.
RocksDB 8.6.7
8.6.7 (2023-09-26)
Bug Fixes
- Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Behavior Changes
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
8.6.6 (2023-09-25)
Bug Fixes
- Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
- Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
. - When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing
LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.
8.6.5 (2023-09-15)
Bug Fixes
- Fixed a bug where
rocksdb.file.read.verify.file.checksums.micros
is not populated.
8.6.4 (2023-09-13)
Public API changes
- Add a column family option
default_temperature
that is used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.
8.6.3 (2023-09-12)
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
8.6.2 (2023-09-11)
Bug Fixes
- Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
8.6.1 (2023-08-30)
Public API Changes
Options::compaction_readahead_size
's default value is changed from 0 to 2MB.
Behavior Changes
- Compaction read performance will regress when
Options::compaction_readahead_size
is explicitly set to 0
8.6.0 (2023-08-18)
New Features
- Added enhanced data integrity checking on SST files with new format_version=6. Performance impact is very small or negligible. Previously if SST data was misplaced or re-arranged by the storage layer, it could pass block checksum with higher than 1 in 4 billion probability. With format_version=6, block checksums depend on what file they are in and location within the file. This way, misplaced SST data is no more likely to pass checksum verification than randomly corrupted data. Also in format_version=6, SST footers are checksum-protected.
- Add a new feature to trim readahead_size during scans upto upper_bound when iterate_upper_bound is specified. It's enabled through ReadOptions.auto_readahead_size. Users must also specify ReadOptions.iterate_upper_bound.
- RocksDB will compare the number of input keys to the number of keys processed after each compaction. Compaction will fail and report Corruption status if the verification fails. Option
compaction_verify_record_count
is introduced for this purpose and is enabled by default. - Add a CF option
bottommost_file_compaction_delay
to allow specifying the delay of bottommost level single-file compactions. - Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
- Implement a new admission policy for the compressed secondary cache that admits blocks evicted from the primary cache with the hit bit set. This policy can be specified in TieredVolatileCacheOptions by setting the newly added adm_policy option.
- Add a column family option
memtable_max_range_deletions
that limits the number of range deletions in a memtable. RocksDB will try to do an automatic flush after the limit is reached. (#11358) - Add PutEntity API in sst_file_writer
- Add
timeout
in microsecond option toWaitForCompactOptions
to allow timely termination of prolonged waiting in scenarios like recurring recoverable errors, such as out-of-space situations and continuous write streams that sustain ongoing flush and compactions - New statistics
rocksdb.file.read.{get|multiget|db.iterator|verify.checksum|verify.file.checksums}.micros
measure read time of block-based SST tables or blob files during db open,Get()
,MultiGet()
, using db iterator,VerifyFileChecksums()
andVerifyChecksum()
. They require stats level greater thanStatsLevel::kExceptDetailedTimers
. - Add close_db option to
WaitForCompactOptions
to call Close() after waiting is done. - Add a new compression option
CompressionOptions::checksum
for enabling ZSTD's checksum feature to detect corruption during decompression.
Public API Changes
- Mark
Options::access_hint_on_compaction_start
related APIs as deprecated. See #11631 for alternative behavior.
Behavior Changes
- Statistics
rocksdb.sst.read.micros
now includes time spent on multi read and async read into the file - For Universal Compaction users, periodic compaction (option
periodic_compaction_seconds
) will be set to 30 days by default if block based table is used.
Bug Fixes
- Fix a bug in FileTTLBooster that can cause users with a large number of levels (more than 65) to see errors like "runtime error: shift exponent .. is too large.." (#11673).