Hi community,
Looking for some ideas on ways to improve searching performance of vector searches on Elasticsearch v8.15.0.
The index I'm running searches on contains 150M documents (they are only loaded once, and never refreshed afterwards), with just 3 fields: id, name, and embeddings (a 128-dimension vector). The Elasticsearch deployment (ECK) is running on a Kubernetes cluster (GKE) and has 40 data nodes with 30 CPUs and 220 GB of memory each, plus 2 masters and 2 coordinators. The index is configured with 40 primary shards (1 per data node, to speed up initial indexing) and 20 replicas each. The index stats look as follows:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open myindex _F_ZALb-TgeO9CzXD101Hw 40 20 149410256 0 14.4tb 702.2gb 702.2gb
The mapping of the embeddings field looks like this:
{
"type": "dense_vector",
"dims": 128,
"index": True,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"ef_construction": 150,
"m": 24,
},
}
Now, I have another dataset, also 150M rows, that I'm reading from BigQuery using Apache Spark with 5 nodes 8 cores each, and I'm iterating over each partition and sending multi search requests with a batch of 50 queries per msearch (8 tasks per node x 5 nodes x 50 simultaneous queries = 1250 concurrent searches).
I'm using the following query:
{
"query": {
"bool": {
"should": [{
"knn": {
"field": "embeddings",
"query_vector": [1.0, 0.54, 0.01, 1.5, ...],,
"k": 10,
"num_candidates": 100,
}
}]
}
},
"sort": {"_score", "desc"},
"fields": ["id"],
"size": 10,
"_source": false
}
As you can imagine, searches take a super long time. I did some estimates and searching 1M rows takes around 2 hours. So, are there things I can change to improve the search performance?
Thanks,
Aldo