_forcemerge on ES 7.3 issue

jeroen1 · August 30, 2019, 6:57pm

I've upgraded hardware. Clean install of ES 7.3. Importing the same static dataset I was using on my old ES 6 box. Importing on 7 is faster BUT... after completing the import I'm running a _forcemerge on the ~1.1TB dataset. On 6 this runs for some time, slowly but surely decreasing segments count and disk space used. On 7 there're an issue: the same command on the same dataset already runs for quite some time and the segment count is slowly INCREASING and disk usages of the index already more than doubled (from 1.1TB to >2.2 TB).

Any clues about what's going on and why ES / Lucene is using >100% temp storage? Is there a workaround to not use all this temp storage?

jeroen1 · September 1, 2019, 8:01pm

This is what is looks like:

DavidTurner · September 1, 2019, 9:31pm

Are you still indexing into this index while the force-merge is running, or did you import everything, wait for that to complete, and then request a force-merge? What exactly was the request you made to the force-merge API? Was the cluster health green throughout?

jeroen1 · September 1, 2019, 10:26pm

Are you still indexing into this index while the force-merge is running,

No.

or did you import everything, wait for that to complete, and then request a force-merge?

Yes.

What exactly was the request you made to the force-merge API?

curl -H'Content-Type: application/json' -XPOST 'localhost:9200/myindex/_forcemerge?max_num_segments=1&pretty'

Was the cluster health green throughout?

Yes.

DavidTurner · September 2, 2019, 7:12am

Does the disk space reduce again if you flush and refresh on this index?

jeroen1 · September 2, 2019, 8:55am

Does the disk space reduce again if you flush and refresh on this index?

What commands would you recommend to flush and refresh? Thanks.

DavidTurner · September 2, 2019, 9:08am

This should be fine:

POST /myindex/_flush
POST /myindex/_refresh

jeroen1 · September 2, 2019, 3:24pm

Update: merge process continued until 100% of disk space was taken. At that time the merge took > 2x index size(!). The _forcemerge however seems to be successful ("failed" : 0). Ran _flush and _refresh, waited a few minutes. Result: number of segments in the end went DOWN to a value lower than requested (I used rather high value instead of 1 to see if it would work anyway), disk space back to normal. Hmmm, that's pretty strange... Perhaps the process was already finished before but kept on writing data for a magic reason? Is there a way I can provide you with debug information?

Something that caught my eye in the storage directory: I see large files with both namens like Lucene50 and Lucene80. Seems like v5 and v8 files for an amateur like me, where I expect v8 only. Is this correct? The system contains a clean ES 7.3 install, clean generation of the index (no upgrades / re-indexing from older versions). Other info: using best_compression and large ngram tokenizers.

DavidTurner · September 2, 2019, 5:21pm

Thanks for the update. It is possible that this is related to https://github.com/elastic/elasticsearch/pull/46066 in which Elasticsearch isn't always as enthusiastic about flushing as perhaps it should be. Did you try flushing while the force-merge was ongoing too, or only at the end? Can you wait for the release of 7.4.0 and then try again?

I will check, but I don't think this is something to worry about.

jeroen1 · September 2, 2019, 5:42pm

It is possible that this is related to https://github.com/elastic/elasticsearch/pull/46066 in which Elasticsearch isn't always as enthusiastic about flushing as perhaps it should be.

Sounds plausible, thanks for the link!

Did you try flushing while the force-merge was ongoing too, or only at the end?

Only at the end. Will try next time when ongoing.

Can you wait for the release of 7.4.0 and then try again?

Yes, it's a non-production box used to make sure that there a no similar issues in production after a 'quick' upgrade

What would you suggest for now, e.g. screening the following during the merge?

while true
do
  sleep 600
  curl  -H'Content-Type: application/json' -XPOST 'localhost:9200/myindex/_flush
done

Or something more sophisticated?

DavidTurner · September 2, 2019, 6:12pm

Confirmed, this is fine.

If that works then it seems like a reasonable workaround. We'd like to know if it does work, because if it doesn't then you might have hit a different (and as-yet-unknown) issue instead.

jeroen1 · September 3, 2019, 7:37pm

If that works then it seems like a reasonable workaround. We'd like to know if it does work, because if it doesn't then you might have hit a different (and as-yet-unknown) issue instead.

I've started a new run, using a smaller dataset, for faster results. Similar behavior: segment count slowly grows (difficult to see on the screendump because of the huge drop, but still the case) and disk usage significantly increased over the hours. Then your suggested workaround (flush and refresh), and...

Looking good

So it seems to be the same or a similar issues as described @ https://github.com/elastic/elasticsearch/pull/46066.

Conclusion: suggested workaround is effective in 7.3, structural fix as soon as 7.4 is released right?

DavidTurner · September 3, 2019, 8:46pm

Great, thanks for reporting back. I also expect 7.4 contains a fix for this, yes, but please let us know if the problem isn't fixed there and we'll investigate further.

system · October 1, 2019, 8:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disk space, delete-by-query, forcemerge Elasticsearch	6	2598	October 2, 2018
ES 2.3.3 Force merge not merging? Elasticsearch	11	3297	January 6, 2017
ElasticSearch ForceMerge Elasticsearch	3	464	January 5, 2017
After migration from Elasticsearch 6.3 to 7.17 the index size on disk doubled Elasticsearch	4	258	May 24, 2023
_forcemerge query not working? (from Elastic Cloud - ES 2.1.1) Elasticsearch	2	1352	July 5, 2017

_forcemerge on ES 7.3 issue

Related topics