Hi community,
Looking for some ideas on ways to improve searching performance of vector searches on Elasticsearch v8.15.0
.
The index I'm running searches on contains 150M documents (they are only loaded once, and never refreshed afterwards), with just 3 fields: id
, name
, and embeddings
(a 128-dimension vector). The Elasticsearch deployment (ECK) is running on a Kubernetes cluster (GKE) and has 40 data nodes with 30 CPUs and 220 GB of memory each, plus 2 masters and 2 coordinators. The index is configured with 40 primary shards (1 per data node, to speed up initial indexing) and 20 replicas each. The index stats look as follows:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open myindex _F_ZALb-TgeO9CzXD101Hw 40 20 149410256 0 14.4tb 702.2gb 702.2gb
The mapping of the embeddings
field looks like this:
{
"type": "dense_vector",
"dims": 128,
"index": True,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"ef_construction": 150,
"m": 24,
},
}
Now, I have another dataset, also 150M rows, that I'm reading from BigQuery using Apache Spark with 5 nodes 8 cores each, and I'm iterating over each partition and sending multi search requests with a batch of 50 queries per msearch (8 tasks per node x 5 nodes x 50 simultaneous queries = 1250 concurrent searches).
I'm using the following query:
{
"query": {
"bool": {
"should": [{
"knn": {
"field": "embeddings",
"query_vector": [1.0, 0.54, 0.01, 1.5, ...],,
"k": 10,
"num_candidates": 100,
}
}]
}
},
"sort": {"_score", "desc"},
"fields": ["id"],
"size": 10,
"_source": false
}
As you can imagine, searches take a super long time. I did some estimates and searching 1M rows takes around 2 hours. So, are there things I can change to improve the search performance?
Thanks,
Aldo