we have been using Elasticsearch 1.7.4 for over 3 years now, and we just upgraded to 6.1.1 version, We have completed our data backfill and start testing our queries. But unfortunately we are seeing really high latency for very simple terms queries.
here are some data about our cluster
We have around 90Billion docs, with a single index, index size of 36TB. with 2 replicas. Full cluster size with 2 replicas is around 100TB.
We index around 5K docs/second.
We want to run around 7 -10K queries/second.
We have 400 host in the fleet, earlier this was 350, we just increased to 400 to see if horizontal scaling works.
We have allocated 30Gb of heap to the elasticsearch and remaining 30Gb to file system cache.
We are using CMS as our GC.
Here is what we are seeing, our indexing performs really well with no queries.
we perform some msearch which is purely terms query with sum aggregation and even they return in 250-1000ms.
But the moment we add our regular terms queries, our latency spikes to anywhere from 2-10sec. And we have drop in throughput, and we see a lot of old generation GC kick in.
We do not see any queries been cached, our query-cache size in few hundred MB. Once we add queries, our heap jumps from 7Gb to around 25Gb and starts causing high GC times impacting our latency.
We have a near real time use case, hence our refresh interval is 1s. when we reduce our refresh interval to 30s, we see a improvement in latency as well.
Can you guys point us to any specific metrics we should be looking at.
Some data from a host
One host which is consuming around 28Gb of heap
1- there are only 239 segments consuming 800MB space,
2- query cache has 200mb space.
3- translog is consuming 400MB space.
Did not find any other indicators on what is consuming the heap.
Are there any know memory leak with 6.1.1 version.