Elasticsearch high latency

Hey guys,

we have been using Elasticsearch 1.7.4 for over 3 years now, and we just upgraded to 6.1.1 version, We have completed our data backfill and start testing our queries. But unfortunately we are seeing really high latency for very simple terms queries.

here are some data about our cluster
We have around 90Billion docs, with a single index, index size of 36TB. with 2 replicas. Full cluster size with 2 replicas is around 100TB.
We index around 5K docs/second.
We want to run around 7 -10K queries/second.
We have 400 host in the fleet, earlier this was 350, we just increased to 400 to see if horizontal scaling works.
We have allocated 30Gb of heap to the elasticsearch and remaining 30Gb to file system cache.
We are using CMS as our GC.

Here is what we are seeing, our indexing performs really well with no queries.
we perform some msearch which is purely terms query with sum aggregation and even they return in 250-1000ms.
But the moment we add our regular terms queries, our latency spikes to anywhere from 2-10sec. And we have drop in throughput, and we see a lot of old generation GC kick in.

We do not see any queries been cached, our query-cache size in few hundred MB. Once we add queries, our heap jumps from 7Gb to around 25Gb and starts causing high GC times impacting our latency.

We have a near real time use case, hence our refresh interval is 1s. when we reduce our refresh interval to 30s, we see a improvement in latency as well.

Can you guys point us to any specific metrics we should be looking at.

Some data from a host

One host which is consuming around 28Gb of heap
1- there are only 239 segments consuming 800MB space,
2- query cache has 200mb space.
3- translog is consuming 400MB space.
Did not find any other indicators on what is consuming the heap.

Are there any know memory leak with 6.1.1 version.

What is the difference between these two term queries other than the msearch and sum agg?

Size is zero in the msearch query.
Size is non-zero in the high latency query.

In high latency query, there could be at times a match query to do full text search.

But apart from these, they perform the same types of terms queries on same set of fields.

The retrieval of source and highlighting for a few docs can often be slower than the search through billions of docs used to find them. Search on a rare term will involve few operations and is not linear to num docs searched whereas the retrieval/markup of doc content can be more expensive and is linear to the num docs being returned.

We do not get source, not do any highlighting. Our queries are something like this

routing = X

For X find everything with field value Y and field value Z. So we limit the number of docs to search and to a limited number of shards. At most 5 shards and docs can be anywhere from 500 to 3000.

I'm not sure which "size" you're referring to then - I'd assumed it was the number of hits you want back?
Even if you are only interested in the "metadata" fields like ID and type that still involves a document retrieval. Assuming a 5ms random disk seek time then 3000 retrievals can add up.
I think it would help to see some raw JSON - one for the slow query and one for the fast.

Also - are you saying the same queries on the same data and same hardware setup have altered significantly between these 2 versions?

Yes it is the number of hits we want back

Here is a sample slow query, which no matter when we add always gives bad performance

took[2.6s], took_millis[2670], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[5]
{"size":1,"timeout":"30s","query":{"constant_score":{"filter":{"bool":{"must":[{"term":{"fieldA":{"value":"valueA","boost":1.0}}},{"bool":{"should":[{"terms":{"fieldB":["valueB"],"boost":1.0}},{"term":{"fieldC":{"value":valueC,"boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},{"terms":{"fieldD":[0],"boost":1.0}},{"terms":{"fieldE":[1,0,3,2],"boost":1.0}},{"term":{"fieldF":{"value":"valueF","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"boost":1.0}},"_source":false}

Same query with size 0 (with aggs, getting an example for this will take time, has sensitive information), returns in a few milli seconds, when we do not have any of the above query hitting the cluster.

This query is a sample when the latency is 2sec, but this rises upto 10secs as well, and during this time the heap on the node will be around 90%.

yes, same query, same hardware, same dataset. Everything is almost the same.
Expect version of elasticsearch is 6.1.1

So is the sample slow query JSON representative? It has size:1

Yes, same queries run with different size params. Number of docs they match are from 500 to 3000.

For these exact-term lookups are any of these values mapped as numeric data types?

If so, you may be hitting a problem with the way the data is indexed for use. The assumption is that numeric values are always used for representing quantities of things e.g. prices, sizes, weights, areas etc and are therefore indexed to support range queries efficiently rather than any direct-lookups on unique values. A direct lookup on an exact integer field value actually has to run as a range query behind the scenes.

If you have numeric fields used only for direct lookups eg used as a form of unique ID, they should be indexed as keyword fields for efficient lookup.

Some of the fields there are indexed as shorts/booleans, we were under the impression that shorts are fast for lookup.
We can try changing that and run the test again. But i am still not sure if that should cause such high gc and performance issue. They are like four such fields.

Hey

I got a heap dump of one of the host with high heap and it shows there is 13GB of unreachable objects, is that expected ? Out of that 11GB value stored in long fields.

Another update here is that in our queries we use sorting for the slow queries.

Hey guys

I just got a heap dump of a node which was consuming around 30Gb of heap. Out of the 30Gb - 12Gb is consumed by long values and they are references by org.apache.lucene.util.FixedBitSet which is inturn referenced by FilterLeafReader.

Is this something you have seen before ?

Update -

When we reduced our scroll time from 10 minutes to 10seconds, our latency dropped down to 200ms time window, Is there some setting we are missing to hold large open context on scroll

Another update -

We use custom routing and data for a given routing parameter is distributed in 5 shards. we are using the index.routing_partition_size param and setting that to 5. So it is possible that the heap is growing faster on too many host because each of these shards have to now keep a context for each request for a given query ?

The scroll api can be a resource hog if you have many in play. Each scroll is preserving a-point-in-time view on a set of files that were current at the time the scroll was started. These locks are expensive because the files undergo constant change during indexing and the same document can be reorganised multiple times as part of the file merging process. "Normal" search quickly forgets about older file organisations which can be thrown away by background compaction tasks. Maintaining scroll contexts mean we have to hang on to out-dated files and related caches to preserve these point-in-time views.

We have reverted the change to distribute the information across 5 shards, ill update this thread once we have a successful test.

After reverting the change we had a very successful test, our queries performed really well. Thanks @Mark_Harwood for all your help.

Hi Arun

Did you mean that in upgrading to 6x the number of shards were also increased?

I am investigating a production incident on our es. i believed that more shards caused gc and took down the cluster.. However in a test i added a lot of shards(indices without data) but the es cluster did not go down. now here there was no data coming in or queries running against dummy indices.

Your inputs might help me understand es behaviour better.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.