Elasticsearch ILM stuck at forcemerge

Hi,

I currently have an ILM policy that does the following:

As mentioned here, after shrink I end up having a shard size of 40GB. I have then configured ILM to delete the indices after 7 days.

However, the ILM policy for all my indices is stuck at the 'forcemerge' stage.

image
Note: GET indexname/_ilm/explain gives the same result.

It is not moving to the next stage. Is the forcemerge pending due to an issue or are the segment count too small?

I understand that currently there is no option to check the forcemerge status, the elasticsearch logs also do not have any information about the issue. Please suggest.

PS: The monitoring data for the index does show a drop in the segment count, but that does not match with the policy segment count.
image image

Cheers!

You may be running into the first issue described in this GitHub issue where ILM doesn't retry after an interrupted force merge. Can you try re-running the force merge by hand and seeing if that helps?

1 Like

To answer this question, this means that ILM has issued a force merge request and is waiting for it to complete. The issue I linked to above describes a situation where ILM can get "stuck" waiting for the force merge to complete if the merge is interrupted for some reason.

Running the forcemerge by hand, does not seem to work. Specifying the max_number_of_segments as 1 gives me a timeout error:

{
  "statusCode": 504,
  "error": "Gateway Time-out",
  "message": "Client request timeout"
}

No logs on Co-coordinating node, that Kibana talks to, showing any issues.

@gbrown You could be right, it does seem that this is linked to the issue you mentioned. https://github.com/elastic/elasticsearch/issues/42824

After I ran the forcemerge manually yesterday, for some reason the index does not seem to exist on my cluster today. Could it be that since the forcemerge happened, the index was deleted by the ILM?

To confirm, I have again run forcemerge on the next index. If it disappears by tomorrow, then I guess this is what is happening.

1 Like

I think that's likely. From the Force Merge docs:

This call will block until the merge is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous force merge is complete.

Force merge can take quite a while, so it's possible that something timed out while waiting for the force merge to complete and caused that error to pop up, but it eventually did complete in the background.

Is this happening to every index, or just some of them? I would be surprised if this is happening to every index, and that could indicate something unusual about your setup or an as-yet-unknown bug.

I believe this is what is happening. I currently have two ILM policies, one for my high volume index, and one for my low volume index. The high volume policy creates 6 shards and then shrinks+forcemerges it. The low volume one has 1 shard and only does a forcemerge.

I currently have to wait for almost 2-3 hours for the index to complete forcemerge for the larger index. Haven't reached the forcemerge stage on my low volume indices. But it is important to note that I do not have any segregation between my nodes. All my nodes are currently master+data nodes, and I have not given any node attributes to classify them as hot-warm-cold as such.

Would classifying certain nodes as hot, warm and cold improve the forcemerge operation speed? I know it is not recommended to perform a forcemerge on the node, while data is being ingested. This could probably be the cause for the issue. Thoughts?

Also, this is the correct deduction. Running forcemerge does push ILM to the next stage.

It might? It's difficult to make a concrete recommendation here without much more detailed knowledge of your setup.

Force merge can use a lot of system resources (CPU and IO especially), and allocating the indices to be merged to different nodes than are handling indexing would provide greater separation - it would probably make performance more predictable by saying "these nodes are only going to handle indexing and new data, these other nodes are only going to handle force merges and historical queries" rather than randomly having a mix of the two on any given node, but it may or may not result in significantly faster merges on average - it's difficult to say without knowing lots more details about your cluster, from hardware specifications of your servers to indexing/query balance in your workload.

It's also worth noting that indexes are fully queryable during the merging process - you generally shouldn't be too concerned with how long merges take, unless the resource usage is interfering with some other part of your system.

1 Like

Understood. I will give this a shot and get back with the results.

Till then I guessed, I will use Curator to perform the forcemerge. Not a good choice. :smiley:

The job has been running for the past 8 hours. Still have 4 indices to go. But I guess, this had to be done as there is no workaround for it right now. Hopefully things will be smooth once I schedule the Curator job daily.

That's a relief. I was worried I was getting incomplete results. I do not have any resource usage issues, yet in my environment. Hoping it stays till i tweak the architecture a bit. Would changing the node attributes result in any kind of downtime?

I am curious why this is consistently a problem in your environment - if you're hitting the issue I linked to, I would expect that to only happen when a force merge is in progress when a shard is relocated, a node is restarted, or something else.

Changing node attributes requires a node restart on the node you're changing the attributes of, but you can do that one node at a time with a rolling restart strategy, so you should be able to do it without any cluster downtime.

I opened a PR to fix this today: https://github.com/elastic/elasticsearch/pull/43246

1 Like

This was right. This is a horrible choice. The forcemerge operations ended up eating all my system resources. My indexing speed was reduced by 1/3. Eventually cancelled the job. Closed all the indices on which the shrink was being attempted, and then indexing was back to normal.

I guess I will wait for this (https://github.com/elastic/elasticsearch/pull/43246) to be merged in GA release, and having the reworked architecture with separate hot-warm nodes . :smiley:

This is exactly when the issue is occurring. After the shrink operation when the cluster goes yellow while the shards are being relocated.

Also, a weird thing I have noticed is that Logstash stops indexing data after this happens. I am running logstash on Docker, and for some reason enabling ILM shrink-merge operation causes it to misbehave.

Will dig around a little more and take that up separately. For now, redeploying the Logstash pipeline using a simple monitoring scripts, restores the indexing back to normal.

So, I upgraded my cluster to 7.1.1 and also implemented the hot warm architecture as mentioned here:

Good news is, that now my indexes are able to shrink without interrupting Logstash. Also, the forcemerge operations happen effortlessly for the smaller indices. For larger indices, of the two attempts made till date to forcemerge, one was successful while one was pending as of writing this post (Pull merged #43246).

So to conclude, implementing the hot warm architecture did help me to use ILM more effectively and effeciently, without affecting ingestion.

Thanks @gbrown and @dakrone for the help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.