Skip to content

Commit 218421c

Browse files
committed
Remove Annoy indexes
Annoy indexes fell out of favor in the community, at least when it comes to vector databases. Such indexes work okay-ish low dimensions but they suffers badly from a curse of dimensionality which makes them inapt for a high number of dimensions. Now that Annoy is gone, issue (*) also disappears and we can drop 'no-ubsan', 'no-cpu-aarch64', and 'no-asan' from tests. (*) spotify/annoy#456
1 parent 7c41939 commit 218421c

29 files changed

+32
-1010
lines changed

.gitmodules

-3
Original file line numberDiff line numberDiff line change
@@ -230,9 +230,6 @@
230230
[submodule "contrib/minizip-ng"]
231231
path = contrib/minizip-ng
232232
url = https://github.com/zlib-ng/minizip-ng
233-
[submodule "contrib/annoy"]
234-
path = contrib/annoy
235-
url = https://github.com/ClickHouse/annoy
236233
[submodule "contrib/qpl"]
237234
path = contrib/qpl
238235
url = https://github.com/intel/qpl

contrib/CMakeLists.txt

-1
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,6 @@ add_contrib (morton-nd-cmake morton-nd)
205205
if (ARCH_S390X)
206206
add_contrib(crc32-s390x-cmake crc32-s390x)
207207
endif()
208-
add_contrib (annoy-cmake annoy)
209208

210209
option(ENABLE_USEARCH "Enable USearch" ${ENABLE_LIBRARIES})
211210
if (ENABLE_USEARCH)

contrib/annoy

-1
This file was deleted.

contrib/annoy-cmake/CMakeLists.txt

-23
This file was deleted.

docs/en/engines/table-engines/mergetree-family/annindexes.md

+14-73
Original file line numberDiff line numberDiff line change
@@ -126,81 +126,8 @@ was specified for ANN indexes, the default value is 100 million.
126126

127127
# Available ANN Indexes {#available_ann_indexes}
128128

129-
- [Annoy](/docs/en/engines/table-engines/mergetree-family/annindexes.md#annoy-annoy)
130-
131129
- [USearch](/docs/en/engines/table-engines/mergetree-family/annindexes.md#usearch-usearch)
132130

133-
## Annoy {#annoy}
134-
135-
Annoy indexes are currently experimental, to use them you first need to `SET allow_experimental_annoy_index = 1`. They are also currently
136-
disabled on ARM due to memory safety problems with the algorithm.
137-
138-
This type of ANN index is based on the [Annoy library](https://github.com/spotify/annoy) which recursively divides the space into random
139-
linear surfaces (lines in 2D, planes in 3D etc.).
140-
141-
<div class='vimeo-container'>
142-
<iframe src="//www.youtube.com/embed/QkCCyLW0ehU"
143-
width="640"
144-
height="360"
145-
frameborder="0"
146-
allow="autoplay;
147-
fullscreen;
148-
picture-in-picture"
149-
allowfullscreen>
150-
</iframe>
151-
</div>
152-
153-
Syntax to create an Annoy index over an [Array(Float32)](../../../sql-reference/data-types/array.md) column:
154-
155-
```sql
156-
CREATE TABLE table_with_annoy_index
157-
(
158-
id Int64,
159-
vectors Array(Float32),
160-
INDEX [ann_index_name] vectors TYPE annoy([Distance[, NumTrees]]) [GRANULARITY N]
161-
)
162-
ENGINE = MergeTree
163-
ORDER BY id;
164-
```
165-
166-
Annoy currently supports two distance functions:
167-
- `L2Distance`, also called Euclidean distance, is the length of a line segment between two points in Euclidean space
168-
([Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)).
169-
- `cosineDistance`, also called cosine similarity, is the cosine of the angle between two (non-zero) vectors
170-
([Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)).
171-
172-
For normalized data, `L2Distance` is usually a better choice, otherwise `cosineDistance` is recommended to compensate for scale. If no
173-
distance function was specified during index creation, `L2Distance` is used as default.
174-
175-
Parameter `NumTrees` is the number of trees which the algorithm creates (default if not specified: 100). Higher values of `NumTree` mean
176-
more accurate search results but slower index creation / query times (approximately linearly) as well as larger index sizes.
177-
178-
:::note
179-
All arrays must have same length. To avoid errors, you can use a
180-
[CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK
181-
length(vectors) = 256`. Also, empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported.
182-
:::
183-
184-
The creation of Annoy indexes (whenever a new part is build, e.g. at the end of a merge) is a relatively slow process. You can increase
185-
setting `max_threads_for_annoy_index_creation` (default: 4) which controls how many threads are used to create an Annoy index. Please be
186-
careful with this setting, it is possible that multiple indexes are created in parallel in which case there can be overparallelization.
187-
188-
Setting `annoy_index_search_k_nodes` (default: `NumTrees * LIMIT`) determines how many tree nodes are inspected during SELECTs. Larger
189-
values mean more accurate results at the cost of longer query runtime:
190-
191-
```sql
192-
SELECT *
193-
FROM table_name
194-
ORDER BY L2Distance(vectors, Point)
195-
LIMIT N
196-
SETTINGS annoy_index_search_k_nodes=100;
197-
```
198-
199-
:::note
200-
The Annoy index currently does not work with per-table, non-default `index_granularity` settings (see
201-
[here](https://github.com/ClickHouse/ClickHouse/pull/51325#issuecomment-1605920475)). If necessary, the value must be changed in config.xml.
202-
:::
203-
204131
## USearch {#usearch}
205132

206133
This type of ANN index is based on the [USearch library](https://github.com/unum-cloud/usearch), which implements the [HNSW
@@ -211,6 +138,8 @@ that are expensive to load and compare. The library also has several hardware-sp
211138
distance computations on modern Arm (NEON and SVE) and x86 (AVX2 and AVX-512) CPUs and OS-specific optimizations to allow efficient
212139
navigation around immutable persistent files, without loading them into RAM.
213140

141+
USearch indexes are currently experimental, to use them you first need to `SET allow_experimental_usearch_index = 1`.
142+
214143
<div class='vimeo-container'>
215144
<iframe src="//www.youtube.com/embed/UMrhB3icP9w"
216145
width="640"
@@ -247,3 +176,15 @@ was specified during index creation, `f16` is used as default.
247176

248177
For normalized data, `L2Distance` is usually a better choice, otherwise `cosineDistance` is recommended to compensate for scale. If no
249178
distance function was specified during index creation, `L2Distance` is used as default.
179+
180+
:::note
181+
All arrays must have same length. To avoid errors, you can use a
182+
[CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK
183+
length(vectors) = 256`. Also, empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported.
184+
:::
185+
186+
:::note
187+
The USearch index currently does not work with per-table, non-default `index_granularity` settings (see
188+
[here](https://github.com/ClickHouse/ClickHouse/pull/51325#issuecomment-1605920475)). If necessary, the value must be changed in config.xml.
189+
:::
190+

src/CMakeLists.txt

-4
Original file line numberDiff line numberDiff line change
@@ -601,10 +601,6 @@ endif()
601601

602602
dbms_target_link_libraries(PUBLIC ch_contrib::consistent_hashing)
603603

604-
if (TARGET ch_contrib::annoy)
605-
dbms_target_link_libraries(PUBLIC ch_contrib::annoy)
606-
endif()
607-
608604
if (TARGET ch_contrib::usearch)
609605
dbms_target_link_libraries(PUBLIC ch_contrib::usearch)
610606
endif()

src/Common/config.h.in

-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,6 @@
5858
#cmakedefine01 USE_FILELOG
5959
#cmakedefine01 USE_ODBC
6060
#cmakedefine01 USE_BLAKE3
61-
#cmakedefine01 USE_ANNOY
6261
#cmakedefine01 USE_USEARCH
6362
#cmakedefine01 USE_SKIM
6463
#cmakedefine01 USE_PRQL

src/Core/Settings.h

+3-3
Original file line numberDiff line numberDiff line change
@@ -909,12 +909,9 @@ class IColumn;
909909
M(Bool, allow_experimental_time_series_table, false, "Allows experimental TimeSeries table engine", 0) \
910910
M(Bool, allow_experimental_variant_type, false, "Allow Variant data type", 0) \
911911
M(Bool, allow_experimental_dynamic_type, false, "Allow Dynamic data type", 0) \
912-
M(Bool, allow_experimental_annoy_index, false, "Allows to use Annoy index. Disabled by default because this feature is experimental", 0) \
913912
M(Bool, allow_experimental_usearch_index, false, "Allows to use USearch index. Disabled by default because this feature is experimental", 0) \
914913
M(Bool, allow_experimental_codecs, false, "If it is set to true, allow to specify experimental compression codecs (but we don't have those yet and this option does nothing).", 0) \
915914
M(UInt64, max_limit_for_ann_queries, 1'000'000, "SELECT queries with LIMIT bigger than this setting cannot use ANN indexes. Helps to prevent memory overflows in ANN search indexes.", 0) \
916-
M(UInt64, max_threads_for_annoy_index_creation, 4, "Number of threads used to build Annoy indexes (0 means all cores, not recommended)", 0) \
917-
M(Int64, annoy_index_search_k_nodes, -1, "SELECT queries search up to this many nodes in Annoy indexes.", 0) \
918915
M(Bool, throw_on_unsupported_query_inside_transaction, true, "Throw exception if unsupported query is used inside transaction", 0) \
919916
M(TransactionsWaitCSNMode, wait_changes_become_visible_after_commit_mode, TransactionsWaitCSNMode::WAIT_UNKNOWN, "Wait for committed changes to become actually visible in the latest snapshot", 0) \
920917
M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
@@ -1036,6 +1033,9 @@ class IColumn;
10361033
MAKE_OBSOLETE(M, UInt64, parallel_replicas_min_number_of_granules_to_enable, 0) \
10371034
MAKE_OBSOLETE(M, Bool, query_plan_optimize_projection, true) \
10381035
MAKE_OBSOLETE(M, Bool, query_cache_store_results_of_queries_with_nondeterministic_functions, false) \
1036+
MAKE_OBSOLETE(M, Bool, allow_experimental_annoy_index, false) \
1037+
MAKE_OBSOLETE(M, UInt64, max_threads_for_annoy_index_creation, 4) \
1038+
MAKE_OBSOLETE(M, Int64, annoy_index_search_k_nodes, -1) \
10391039
MAKE_OBSOLETE(M, Bool, optimize_move_functions_out_of_any, false) \
10401040
MAKE_OBSOLETE(M, Bool, allow_experimental_undrop_table_query, true) \
10411041
MAKE_OBSOLETE(M, Bool, allow_experimental_s3queue, true) \

src/Databases/DatabaseReplicated.cpp

-1
Original file line numberDiff line numberDiff line change
@@ -1153,7 +1153,6 @@ void DatabaseReplicated::recoverLostReplica(const ZooKeeperPtr & current_zookeep
11531153
query_context->setSetting("allow_experimental_object_type", 1);
11541154
query_context->setSetting("allow_experimental_variant_type", 1);
11551155
query_context->setSetting("allow_experimental_dynamic_type", 1);
1156-
query_context->setSetting("allow_experimental_annoy_index", 1);
11571156
query_context->setSetting("allow_experimental_usearch_index", 1);
11581157
query_context->setSetting("allow_experimental_bigint_types", 1);
11591158
query_context->setSetting("allow_experimental_window_functions", 1);

src/Interpreters/InterpreterCreateQuery.cpp

-2
Original file line numberDiff line numberDiff line change
@@ -787,8 +787,6 @@ InterpreterCreateQuery::TableProperties InterpreterCreateQuery::getTableProperti
787787
if (index_desc.type == INVERTED_INDEX_NAME && !settings.allow_experimental_inverted_index)
788788
throw Exception(ErrorCodes::ILLEGAL_INDEX, "Please use index type 'full_text' instead of 'inverted'");
789789
/// ----
790-
if (index_desc.type == "annoy" && !settings.allow_experimental_annoy_index)
791-
throw Exception(ErrorCodes::INCORRECT_QUERY, "Annoy index is disabled. Turn on allow_experimental_annoy_index");
792790
if (index_desc.type == "usearch" && !settings.allow_experimental_usearch_index)
793791
throw Exception(ErrorCodes::INCORRECT_QUERY, "USearch index is disabled. Turn on allow_experimental_usearch_index");
794792

src/Parsers/ASTIndexDeclaration.h

-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ class ASTIndexDeclaration : public IAST
1313
{
1414
public:
1515
static const auto DEFAULT_INDEX_GRANULARITY = 1uz;
16-
static const auto DEFAULT_ANNOY_INDEX_GRANULARITY = 100'000'000uz;
1716
static const auto DEFAULT_USEARCH_INDEX_GRANULARITY = 100'000'000uz;
1817

1918
ASTIndexDeclaration(ASTPtr expression, ASTPtr type, const String & name_);

src/Parsers/ParserCreateIndexQuery.cpp

+1-3
Original file line numberDiff line numberDiff line change
@@ -89,9 +89,7 @@ bool ParserCreateIndexDeclaration::parseImpl(Pos & pos, ASTPtr & node, Expected
8989
else
9090
{
9191
auto index_type = index->getType();
92-
if (index_type && index_type->name == "annoy")
93-
index->granularity = ASTIndexDeclaration::DEFAULT_ANNOY_INDEX_GRANULARITY;
94-
else if (index_type && index_type->name == "usearch")
92+
if (index_type && index_type->name == "usearch")
9593
index->granularity = ASTIndexDeclaration::DEFAULT_USEARCH_INDEX_GRANULARITY;
9694
else
9795
index->granularity = ASTIndexDeclaration::DEFAULT_INDEX_GRANULARITY;

src/Parsers/ParserCreateQuery.cpp

+1-3
Original file line numberDiff line numberDiff line change
@@ -214,9 +214,7 @@ bool ParserIndexDeclaration::parseImpl(Pos & pos, ASTPtr & node, Expected & expe
214214
else
215215
{
216216
auto index_type = index->getType();
217-
if (index_type->name == "annoy")
218-
index->granularity = ASTIndexDeclaration::DEFAULT_ANNOY_INDEX_GRANULARITY;
219-
else if (index_type->name == "usearch")
217+
if (index_type->name == "usearch")
220218
index->granularity = ASTIndexDeclaration::DEFAULT_USEARCH_INDEX_GRANULARITY;
221219
else
222220
index->granularity = ASTIndexDeclaration::DEFAULT_INDEX_GRANULARITY;

src/Processors/QueryPlan/ReadFromMergeTree.cpp

-5
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
#include <Processors/Transforms/SelectByIndicesTransform.h>
2525
#include <QueryPipeline/QueryPipelineBuilder.h>
2626
#include <Storages/MergeTree/MergeTreeDataSelectExecutor.h>
27-
#include <Storages/MergeTree/MergeTreeIndexAnnoy.h>
2827
#include <Storages/MergeTree/MergeTreeIndexUSearch.h>
2928
#include <Storages/MergeTree/MergeTreeReadPool.h>
3029
#include <Storages/MergeTree/MergeTreePrefetchedReadPool.h>
@@ -1478,10 +1477,6 @@ static void buildIndexes(
14781477
MergeTreeIndexConditionPtr condition;
14791478
if (index_helper->isVectorSearch())
14801479
{
1481-
#if USE_ANNOY
1482-
if (const auto * annoy = typeid_cast<const MergeTreeIndexAnnoy *>(index_helper.get()))
1483-
condition = annoy->createIndexCondition(query_info, context);
1484-
#endif
14851480
#if USE_USEARCH
14861481
if (const auto * usearch = typeid_cast<const MergeTreeIndexUSearch *>(index_helper.get()))
14871482
condition = usearch->createIndexCondition(query_info, context);

src/Storages/MergeTree/MergeTreeIOSettings.cpp

-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ MergeTreeWriterSettings::MergeTreeWriterSettings(
2727
, rewrite_primary_key(rewrite_primary_key_)
2828
, blocks_are_granules_size(blocks_are_granules_size_)
2929
, query_write_settings(query_write_settings_)
30-
, max_threads_for_annoy_index_creation(global_settings.max_threads_for_annoy_index_creation)
3130
, low_cardinality_max_dictionary_size(global_settings.low_cardinality_max_dictionary_size)
3231
, low_cardinality_use_single_dictionary_for_part(global_settings.low_cardinality_use_single_dictionary_for_part != 0)
3332
, use_compact_variant_discriminators_serialization(storage_settings->use_compact_variant_discriminators_serialization)

src/Storages/MergeTree/MergeTreeIOSettings.h

-2
Original file line numberDiff line numberDiff line change
@@ -77,8 +77,6 @@ struct MergeTreeWriterSettings
7777
bool blocks_are_granules_size;
7878
WriteSettings query_write_settings;
7979

80-
size_t max_threads_for_annoy_index_creation;
81-
8280
size_t low_cardinality_max_dictionary_size;
8381
bool low_cardinality_use_single_dictionary_for_part;
8482
bool use_compact_variant_discriminators_serialization;

0 commit comments

Comments
 (0)