We are running an Elasticsearch Cluster in production, here’s the cluster configuration
DataNodes
Total 54
i2.2xlarge EC2 Instances
8 cpu core
61 GB memory (allocated 30G to ES)
2 X 800 SSD storage
Client Nodes
Total 6
r3.large EC2 instance
2 cpu core
15GB memory (12g allocated to ES)
1 X 32 SSD storage
Master Nodes
Total 3
Index Information
We have two index on this cluster
Index 1 (without replication)
size 2.2T
docs 653M
shards - 72 primary (2x replica)
Index 2
size 333G
docs 88M
shards - 16 primary (2x replica)
shards per node - 5
Avg indexing rate 5K/sec
Avg query rate 4k/sec
Most of our queries are served under 500ms range but we do see huge latency ~4-10 seconds for our queries multiple times a day. We looked into these queries and they were no different than the other low latency. These queries doesn’t take more than 500ms when we tried them with curl.
There is no related cpu spike during this time, the only observation is search threads are all blocked during this time, though the search thread queue doesn’t hold more than few 100 queries.
search thread_pool size - 13
search thread_pool queue - 1000
I'm going to attach Hot threads on the cluster in the next post (due to character limit)
At this point we are clueless and it’s impacting our application highly. Would appreciate if you can help us with
understanding what the hot threads mean here ?
what could be the possible root cause that we should investigate in further ?
42.8% (214.2ms out of 500ms) cpu usage by thread 'elasticsearch[es-268][[index2][12]: Lucene Merge Thread #166763]'
This one is a merge being IO throttled. The others are queries doing query things. Are those stack traces taken during the slow times?
It'd be useful to have graphs of things like the query cache hits over time and the disk IO over time. Without lots of shiny graphs it is hard to tell what is up. With graphs you could say "ah, these three things spike when we get slowdowns, it is caused by X."
For what it is worth Elastic's marvel collects those graphs. I don't know other tools off hand that do but I'm sure they exist. All of the APIs are open. It really helps to graph all the things when you have a situation like this.
In addition, GC activity (logs, or ideally, charts) may help. With spikes that large, it could very well be some long running GCs that cause the whole system to pause.
Wanted to follow up if the ES dashboard snapshot was helpful in pointing to a key thing for the high query time or you
Let me know if you need more metrics to dig in further.
I looked and couldn't really see anything. I wonder if you turn on the slowlog could you get a look at the slow queries. If you retry them on your own are they still slow? Like, are the slow ones special or are they all normal?
if you look at the second dashboard and squint a bit when comparing client node heap size and query latency you can often times spot that a major GC is occurring at the same time when your query latency rises (it does not seem to be the case always though so you should also keep an eye on your data nodes).
I'd enable the GC log and start correlating the time stamps of GCs and the times when you experience latency spikes. After that you should probably start carefully tuning your GC settings.
When ES logs a query in the slow-query log, does the reported latency include the time the query stayed in the queue ? I'm wondering if this is just a queuing problem when ES is under load.
Thanks for suggestion Nik, I looked through logs and do not see any slowlog. We have slowlog enabled on our production cluster with a threshold of "2s".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.