Last week I upgraded my ELK stack from 6.4 to 6.5.1. Since then, performance has been declining.
Elastic is running on a single server, cluster consists of one node. I have 3 beats (winlogbeat, metricbeat, filebeat) running, with metricbeat and filebeat sending data directly to elastic, and winlogbeat sending data to logstash.
My CPU jumped considerably after the upgrade. My server has a 2 cores and cpu was consistently around 103% (can go up to 200% due to 2 cores).
I followed some guides and first start looking at my hot threads. I ran GET /_nodes/hot_threads on the DevTools tab. Results coming back didn't indicate high cpu, unless the percentage was per thread. For example 8% CPU on 10 threads would be 80% CPU? One common theme of these thread dumps was the presence of "recoverfromtranslog" in many of the stacks. Why is it continuously trying to recover the translog? I wasn't aware of any corruptions.
I also looked at the logs in the journal for elasticsearch. There I discovered many messages indicating issues with garbage collection. For example: [2018-12-10T13:50:20,527][WARN ][o.e.m.j.JvmGcMonitorService] [joJDebD] [gc][101646] overhead, spent [12.5s] collecting in the last [12.6s].
While typing up this message Elasticsearch completely crashed, java.lang.OutOfMemoryError: Java heap space.
I feel like the knee-jerk reaction should be to throw more hardware at this, starting with memory allocation, but I am concerned that my stack went from really smooth to unusable with the 6.5.1 upgrade. During the upgrade I had the issue where I needed to delete a kibana index. Perhaps its trying to continuously repair some index somewhere and that is breaking everything else?
Thanks David, I will get that data later tonight. One thing I noticed is filebeat seems to push it over the edge. With the filebeat service running, all queries in discover time out. When I stop the service, I can query again. I am querying a different index than filebeat each time. Filebeat is running the iis module (geo loc and user agent processors come to mind for areas to investigate). Of course, it could just be the straw that broke the camel's back.
Hi David, sorry for the delay. I have been having problems creating that thread dump you asked for. The version of Elasticsearch I upgraded from was 6.4.1.
I am using jmap for the dump. Elasticsearch doesn't respond so I attempt to force it, but with force my jre and jdk have to match exactly and that is being problematic. I will continue to try and create the dump.
Last night, Elasticsearch crashed again. The journal indicates it ran out of heap space. Several hours before this crash, I increased the memory in jvm.options, uncommenting and presumably overriding the defaults. There are 16GB's of ram on this server. I left the beats in the same state last night as before the upgrade, where just a single host is sending data via winlog, metric, and file beats.
-Xms8g
-Xmx8g
jmap -dump:format=b,file=/home/devops_mike/elasti-dump 6545
6545: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
Oh yes, I forgot that Elasticsearch sets that by default. You could try opening the heap dump locally and seeing if it looks similar to this one - lots of instances of OptOutQueryCache taking up all the space.
However, 6.5.3 is now released, so rather than wading into a heap dump I think it'd be better to try and upgrade and see if it is the problem that I guessed it might be that is now resolved.
Hi David, I upgraded to version 6.5.3 as it seemed my issue may be similar to what was reported. This did not improve the situation. I am still getting warnings that a majority of time is spent in garbage collection, and heap space still runs out. This morning ElasticSearch only stayed up a couple minutes before crashing. I only have 1 host sending data, but that is just a staging server that gets very little traffic.
I used apt-install for the upgrade to 6.5.3 and went L - E - K while all those services were stopped.
I have gone through the heap dumps using jhat, but must admit my skills are limited there. I have experience with debug diagnostics and .net code, but not jhat/java.
I am currently considering rebuilding the stack completely without attempting to preserve the indices. Not sure if I want to try for 6.5.X again or just go back to 6.4.X since that was working great.
Can I export all my dashboards and visualizations and then import into 6.4.X since the stack is now 6.5.3?
I am also open to any suggestions to try and fix the current stack.
Downgrades are tricky and don't work in general, unfortunately, so unless you can roll back to snapshots that you took before the upgrade the best path is forwards. I'd like to try looking at the heap dumps you've collected. Do you have something like an S3 bucket that you can use to share them? You can mail me a link at david.turner@elastic.co rather than sharing them in public if you'd prefer.
Thanks David, I working with my team to figure out how to share these with you. We don't have an S3 bucket. Also, we agree with you that its best to figure out what's going on with the current setup vs trying to downgrade. It will certainly give us experience troubleshooting!
Is there an FTP I could upload to? Also looking into setting up a drop box.
File is uploading.
java_pid32408.hprof sha256 = 34FC231CF5E5FB3493EA14829F0D3C350909026EB718E53D13197A7790B67567
This dump is from after the initial upgrade to 6.5.1 and was automatically created upon heap full error. Working on getting a second one for after 6.5.3 upgrade.
Thanks, got it. This node seems to have a 1GB heap and 993 shards. Quoting this article:
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured.
So the recommendation is to limit to ~20 shards on such a small heap, about 50x fewer than what I see. What I don't understand now is how this was working for you back in 6.4.1. Was it much more lightly loaded back then?
Woah, I was at 993 shards when the recommendation is 20. I have since allocated 8gb, but still that is 160 vs 993.
When I was on 6.4.1 I had just the one host reporting. Sending file/winlog/metric beats. After upgrade I briefly had 2 more hosts sending data. Upon Elasticsearch having issues, I disabled the beats services on all but the one server.
It sounds like I should be investigating why we have such a high number of shards.
Really appreciate the help on this and your investigation into our heap dump.
If you have a lot of small shards, reduce the number of primary shards per index and consider switching to monthly indices instead of daily. You will need to reindex in order to consolidate existing indices.
Thanks Christian, this sounds like a good idea. I am using the default index name. To set this to monthly, I just add the following line to my filebeat.yml file, right?
The Reindex API makes no effort to handle ID collisions so the last document written will "win" but the order isn’t usually predictable so it is not a good idea to rely on this behavior. Instead, make sure that IDs are unique using a script.
Got them. Christian's suggestion of monthly indices sounds like a good idea.
I would suggest starting by deleting or temporarily closing as many old indices as you can. There are winlogbeat indices from February and March, and closing all of October and November's daily indices would also help reduce the heap pressure a lot while still giving you access to the last two weeks of data.
Then move your inbound data to start using monthly indices with one shard and no replicas. Most of the indices I see are <1GB in size, so a whole month's worth of data will be 30GB or less which is a reasonable size for a single shard. They currently have 3 shards and 1 replica, but you seem to be running a single node so there's no point in asking for a replica (you can change this back later dynamically if needed) and 3 shards is too many in all the cases I can see. Keep an eye on the shard sizes as your usage grows, since there are a few days where your filebeat indices exceeded 1GB.
At this point your cluster should be running more smoothly, and only then should you start to think about reindexing. Do this one month at a time, reopening one batch of daily indices and reindexing them into a single monthly index. The simplest way to check for ID collisions here is going to be to perform the reindex and then check that the total number of documents in the resulting index is equal to the sum of those in the source indices. If it is, there were no ID collisions. I don't expect many collisions since much of this data is from beats which uses a uniquely generated ID for each document, but I don't know if this is true for all the indices.
Thanks David and Christian. I followed your advice and my cluster is now very performant. The upgrade to 6.5.1 was NOT the issue. I think it was just a coincidence that Elasticsearch finally got tipped over the edge right around the time I did the upgrade.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.