_forcemerge on ES 7.3 issue

I've upgraded hardware. Clean install of ES 7.3. Importing the same static dataset I was using on my old ES 6 box. Importing on 7 is faster BUT... after completing the import I'm running a _forcemerge on the ~1.1TB dataset. On 6 this runs for some time, slowly but surely decreasing segments count and disk space used. On 7 there're an issue: the same command on the same dataset already runs for quite some time and the segment count is slowly INCREASING and disk usages of the index already more than doubled (from 1.1TB to >2.2 TB).

Any clues about what's going on and why ES / Lucene is using >100% temp storage? Is there a workaround to not use all this temp storage?

This is what is looks like:

Are you still indexing into this index while the force-merge is running, or did you import everything, wait for that to complete, and then request a force-merge? What exactly was the request you made to the force-merge API? Was the cluster health green throughout?

Are you still indexing into this index while the force-merge is running,

No.

or did you import everything, wait for that to complete, and then request a force-merge?

Yes.

What exactly was the request you made to the force-merge API?

curl -H'Content-Type: application/json' -XPOST 'localhost:9200/myindex/_forcemerge?max_num_segments=1&pretty'

Was the cluster health green throughout?

Yes.

Does the disk space reduce again if you flush and refresh on this index?

Does the disk space reduce again if you flush and refresh on this index?

What commands would you recommend to flush and refresh? Thanks.

This should be fine:

POST /myindex/_flush
POST /myindex/_refresh

Update: merge process continued until 100% of disk space was taken. At that time the merge took > 2x index size(!). The _forcemerge however seems to be successful ("failed" : 0). Ran _flush and _refresh, waited a few minutes. Result: number of segments in the end went DOWN to a value lower than requested (I used rather high value instead of 1 to see if it would work anyway), disk space back to normal. Hmmm, that's pretty strange... Perhaps the process was already finished before but kept on writing data for a magic reason? Is there a way I can provide you with debug information?

Something that caught my eye in the storage directory: I see large files with both namens like Lucene50 and Lucene80. Seems like v5 and v8 files for an amateur like me, where I expect v8 only. Is this correct? The system contains a clean ES 7.3 install, clean generation of the index (no upgrades / re-indexing from older versions). Other info: using best_compression and large ngram tokenizers.

Thanks for the update. It is possible that this is related to Flush engine after big merge by s1monw · Pull Request #46066 · elastic/elasticsearch · GitHub in which Elasticsearch isn't always as enthusiastic about flushing as perhaps it should be. Did you try flushing while the force-merge was ongoing too, or only at the end? Can you wait for the release of 7.4.0 and then try again?

I will check, but I don't think this is something to worry about.

It is possible that this is related to https://github.com/elastic/elasticsearch/pull/46066 in which Elasticsearch isn't always as enthusiastic about flushing as perhaps it should be.

Sounds plausible, thanks for the link!

Did you try flushing while the force-merge was ongoing too, or only at the end?

Only at the end. Will try next time when ongoing.

Can you wait for the release of 7.4.0 and then try again?

Yes, it's a non-production box used to make sure that there a no similar issues in production after a 'quick' upgrade :wink:

What would you suggest for now, e.g. screening the following during the merge?

while true
do
  sleep 600
  curl  -H'Content-Type: application/json' -XPOST 'localhost:9200/myindex/_flush
done

Or something more sophisticated?

Confirmed, this is fine.

If that works then it seems like a reasonable workaround. We'd like to know if it does work, because if it doesn't then you might have hit a different (and as-yet-unknown) issue instead.

If that works then it seems like a reasonable workaround. We'd like to know if it does work, because if it doesn't then you might have hit a different (and as-yet-unknown) issue instead.

I've started a new run, using a smaller dataset, for faster results. Similar behavior: segment count slowly grows (difficult to see on the screendump because of the huge drop, but still the case) and disk usage significantly increased over the hours. Then your suggested workaround (flush and refresh), and...

Looking good :love_you_gesture:

So it seems to be the same or a similar issues as described @ https://github.com/elastic/elasticsearch/pull/46066.

Conclusion: suggested workaround is effective in 7.3, structural fix as soon as 7.4 is released right?

Great, thanks for reporting back. I also expect 7.4 contains a fix for this, yes, but please let us know if the problem isn't fixed there and we'll investigate further.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.