However, the ILM policy for all my indices is stuck at the 'forcemerge' stage.
Note: GET indexname/_ilm/explain gives the same result.
It is not moving to the next stage. Is the forcemerge pending due to an issue or are the segment count too small?
I understand that currently there is no option to check the forcemerge status, the elasticsearch logs also do not have any information about the issue. Please suggest.
PS: The monitoring data for the index does show a drop in the segment count, but that does not match with the policy segment count.
To answer this question, this means that ILM has issued a force merge request and is waiting for it to complete. The issue I linked to above describes a situation where ILM can get "stuck" waiting for the force merge to complete if the merge is interrupted for some reason.
After I ran the forcemerge manually yesterday, for some reason the index does not seem to exist on my cluster today. Could it be that since the forcemerge happened, the index was deleted by the ILM?
To confirm, I have again run forcemerge on the next index. If it disappears by tomorrow, then I guess this is what is happening.
This call will block until the merge is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous force merge is complete.
Force merge can take quite a while, so it's possible that something timed out while waiting for the force merge to complete and caused that error to pop up, but it eventually did complete in the background.
Is this happening to every index, or just some of them? I would be surprised if this is happening to every index, and that could indicate something unusual about your setup or an as-yet-unknown bug.
I believe this is what is happening. I currently have two ILM policies, one for my high volume index, and one for my low volume index. The high volume policy creates 6 shards and then shrinks+forcemerges it. The low volume one has 1 shard and only does a forcemerge.
I currently have to wait for almost 2-3 hours for the index to complete forcemerge for the larger index. Haven't reached the forcemerge stage on my low volume indices. But it is important to note that I do not have any segregation between my nodes. All my nodes are currently master+data nodes, and I have not given any node attributes to classify them as hot-warm-cold as such.
Would classifying certain nodes as hot, warm and cold improve the forcemerge operation speed? I know it is not recommended to perform a forcemerge on the node, while data is being ingested. This could probably be the cause for the issue. Thoughts?
It might? It's difficult to make a concrete recommendation here without much more detailed knowledge of your setup.
Force merge can use a lot of system resources (CPU and IO especially), and allocating the indices to be merged to different nodes than are handling indexing would provide greater separation - it would probably make performance more predictable by saying "these nodes are only going to handle indexing and new data, these other nodes are only going to handle force merges and historical queries" rather than randomly having a mix of the two on any given node, but it may or may not result in significantly faster merges on average - it's difficult to say without knowing lots more details about your cluster, from hardware specifications of your servers to indexing/query balance in your workload.
It's also worth noting that indexes are fully queryable during the merging process - you generally shouldn't be too concerned with how long merges take, unless the resource usage is interfering with some other part of your system.
Understood. I will give this a shot and get back with the results.
Till then I guessed, I will use Curator to perform the forcemerge. Not a good choice.
The job has been running for the past 8 hours. Still have 4 indices to go. But I guess, this had to be done as there is no workaround for it right now. Hopefully things will be smooth once I schedule the Curator job daily.
That's a relief. I was worried I was getting incomplete results. I do not have any resource usage issues, yet in my environment. Hoping it stays till i tweak the architecture a bit. Would changing the node attributes result in any kind of downtime?
I am curious why this is consistently a problem in your environment - if you're hitting the issue I linked to, I would expect that to only happen when a force merge is in progress when a shard is relocated, a node is restarted, or something else.
Changing node attributes requires a node restart on the node you're changing the attributes of, but you can do that one node at a time with a rolling restart strategy, so you should be able to do it without any cluster downtime.
This was right. This is a horrible choice. The forcemerge operations ended up eating all my system resources. My indexing speed was reduced by 1/3. Eventually cancelled the job. Closed all the indices on which the shrink was being attempted, and then indexing was back to normal.
This is exactly when the issue is occurring. After the shrink operation when the cluster goes yellow while the shards are being relocated.
Also, a weird thing I have noticed is that Logstash stops indexing data after this happens. I am running logstash on Docker, and for some reason enabling ILM shrink-merge operation causes it to misbehave.
Will dig around a little more and take that up separately. For now, redeploying the Logstash pipeline using a simple monitoring scripts, restores the indexing back to normal.
So, I upgraded my cluster to 7.1.1 and also implemented the hot warm architecture as mentioned here:
Good news is, that now my indexes are able to shrink without interrupting Logstash. Also, the forcemerge operations happen effortlessly for the smaller indices. For larger indices, of the two attempts made till date to forcemerge, one was successful while one was pending as of writing this post (Pull merged #43246).
So to conclude, implementing the hot warm architecture did help me to use ILM more effectively and effeciently, without affecting ingestion.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.