ANN Search Timeouts

SbstnErhrdt · March 16, 2022, 4:11pm

We try to use the ANN (Approximate Nearest Neighbor) feature of the 8.0 Version of Elasticsearch. (k-nearest neighbor (kNN) search | Elasticsearch Guide [8.1] | Elastic)

At the moment we have indexed 16 million documents
Index a: ~6Mio
Index b: ~10Mio

We created the index using the following mapping

	{
		"mappings": {
			"properties": {
				"vector": {
					"type": "dense_vector",
					"dims": 768,
					"index": true,
					"similarity": "l2_norm"					
				}
			}
		}
	}

And the following query to retrieve the ann results

POST a/_knn_search
{
	"knn": {
      		"field": "vector",
			"query_vector": [
                  0.5619577,
                  -1.7599238,
                  ...
               ],
       		"k": 100,
		"num_candidates": 1000
    	},
	"_source": ["id", "documentParts.title"]
}

The setup is in a cloud environment where we currently have

8 VCPUs
128GBs of RAM
2TB of SSD storage

On the virtual machine I have set up a Kibana and ES using docker-compose (GitHub - deviantony/docker-elk: The Elastic stack (ELK) powered by Docker and Compose.)

including the following env

    environment:
      - "ES_JAVA_OPTS=-Xmx64g -Xms64g"

This setup worked quite fine and we were happy with the response times. (1-6 seconds for k=50 ANN)

So now we tried to do the same with cosine instead of l2_norm.

	{
		"mappings": {
			"properties": {
				"vector": {
					"type": "dense_vector",
					"dims": 768,
					"index": true,
					"similarity": "cosine"					
				}
			}
		}
	}

We reindex all the data on the same machine and now we have 32 Mio docs.
Index a: ~6Mio
Index b: ~10Mio
Index a_cos: ~6Mio
Index b_cos: ~10Mio

Now we constantly get timeouts for the requests that worked perfectly before.
Error 504 (Gateway Timeout)

Why is it not working anymore?
What changed?
How can I debug this?
Is there a potential solution?

Thanks a lot

mayya · April 1, 2022, 10:24pm

Searches are very fast if all data structures that are needed for _knn_search are already built and available. So if you index all your data, you don't have any more index updates, and then force merge to a single segment, wait till force_merge to be done, and then run your searches, you will get the best search performance.

What is slow is indexing, as building of HNSW graphs required for _knn_search is an expensive operation . So if you have concurrent indexing and search operations, periodically (by default every second) Elasticsearch will trigger refresh operation that will create a new segment and build a new HNSW graph for this segment to make new indexed data available for search. Some search operations will wait for these refreshes to finish, and can time out. Also, the more segments are created, the slower are searches, as it is faster to search one big HNSW graph, that many small ones.
The best way is to separate searches from indexing. Also, in you are not very concerned to make indexed data immediately available for searches, you can increase refresh_interval.

SbstnErhrdt · April 3, 2022, 9:56am

Hallo Mayya,
thanks for your reply.

So I have set, as you proposed the following values in the index settings:

  "index.refresh_interval": "-1",

and executed the force merge request to the indices.

There is no improvement.

Do I need to change my setup to speed it up?
For example have more but smaller instances?

Thanks a lot for your help.

mayya · April 4, 2022, 9:47pm

Do you do index updates at the same time as searches?
Are the timeouts you are getting for search or index requests?

SbstnErhrdt · April 5, 2022, 8:31am

No additional data is indexed. The KNN /ANN searches are performed after the 16 Mio documents were added.

The timeouts come from the backend / middleware. It's currently set to 2 Minutes.
I did a curl request directly on the machine where Elasticsearch runs.
It takes 6.25 Minutes to return the result.
The request was executed on the b_cos index and had the following params:

"k": 100,
"num_candidates": 1000

mayya · April 5, 2022, 12:10pm

6.25 minutes is very slow and it should not be that slow.
We have done an experiment with 10 million docs (although much smaller dimensions 96 versus yours of 768), and knn-search-100-1000 (k: 100, candidates: 1000) takes 11 ms. And this was done on a very modest machine (8Gb of heap).

Can you try the following:

leave it to Elasticsearch to automatically sets the JVM heap size; or at least have it to 30Gb max; as your 64GB is too high, and doesn't leave much space for system cache. Elasticsearch doesn't need that much Java heap memory, a lot of data files are memory mapped.
disable source in your query: "_source": false and run a query again. Make sure to run queries multiple times to get an average run time.

SbstnErhrdt · April 5, 2022, 4:14pm

Hi mayya, thanks again for the reply.

So I set the heap size to 24Gb.

I then send my previous requests to the database.
It felt like there was no change.

But after that, I wrote a benchmark script.

100 requests
Randomly generated vectors
2 randomly selected indices a_cos, b_cos
k = 100 neighbours
num_candidates = 1000

Results:
with "_source": false.

AVG: 1.449s
MIN: 0.751s
MAX: 2.403s

with "_source": ["title"]

AVG: 1.609s
MIN: 0.803s
MAX: 2.559s

So it seems like the reduction was helpful.

I will keep an eye on it and keep u posted.

But thanks a lot

Julie_Tibshirani · April 6, 2022, 7:03pm

Adding one other idea, since you mentioned searches maybe became slower once you switched from l2_norm to cosine for the similarity. The cosine similarity is convenient for testing and development, but can be slower to compute than the other types. For best performance, we recommend normalizing all the vectors in advance to have length 1, and using dot_product instead. These docs have more information under the similarity section: Dense vector field type | Elasticsearch Guide [8.1] | Elastic.

SbstnErhrdt · April 9, 2022, 2:10pm

Some feedback after I added a new index c_cos with additional ~3Mio documents including vectors.

Same mapping
Same settings

DB in total:
Index a_cos: ~6Mio
Index b_cos: ~10Mio
Index c_cos: ~3Mio

Same test set up.
Strange behavior.

k=100 
k_num_candidates=1000

a_cos:  1s
b_cos:  30s
c_cos: 2 minutes

I aborted the test.

But after setting source:false and re-running the test:
for each index 100 requests.

a_cos:
AVG: 1.41s
MIN: 0.86s
MAX: 2.07s

b_cos:
AVG: 2.32s
MIN: 1.63s
MAX: 3.40s

c_cos:
AVG: 1.14s
MIN: 0.69s
MAX: 1.72s

So it somehow seems to be the case, that one has to run a few requests with source:false, before one can run them specific source fields.

system · May 7, 2022, 2:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
503 error because of request timeout after 30000ms Kibana	4	1487	May 26, 2020
The security settings of 8+ versions is too too complicate Elasticsearch elastic-stack-security , docker	5	232	June 27, 2023
Elasticsearch Unreachable - Docker Elasticsearch docker	1	2062	September 6, 2019
Unable to connect ElassticSearch on Dockers Container Kibana	2	353	December 3, 2018
Missing documents in index in elasticsearch on docker Elasticsearch	1	578	May 23, 2019

ANN Search Timeouts

Related topics