After migration from Elasticsearch 6.3 to 7.17 the index size on disk doubled

Hello everyone,

I've migrated an Elasticsearch 6.3 cluster to the version of 7.17 (by creating a new cluster with the same index mapping/shard structure and reindexing) and the index size on the disk almost doubled. The cluster is relatively small from the size perspective (~1M documents) but has a high indexing RPS (~105k/s).

I checked the _cat/segments and there is a slightly higher number of segments in ES7 but more critical the size of segments increased by 1.63 (avg size) and 1.97 (max segment sizes). Also the average number of deleted documents in each segment increased by more than 3 times. The ES6 Lucene segment version is 7.3.1 and the ES7.17 segment version is 8.11.1.

  1. Is that increase in disk space expected?
  2. If so how can I improve this? (based on the 3x increased number of deleted document I suspect that the segment merge process in ES7.17 is slower than my ES6 old cluster)

Thank you in advance!

Elasticsearch 6.3 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Welcome to our community! :smiley:

There have been changes to merges to allow them to be more intelligently managed and not overwhelm the nodes, which used to happen quite a lot. This can show up like what you re seeing, which means you might want to increase the hardware resources available, especially the disk.

Hi Mark,

Thank you for welcoming and for a quick reply. I'm glad to join the Elastic community!

There have been changes to merges to allow them to be more intelligently managed and not overwhelm the nodes, which used to happen quite a lot.

Are you talking about the Reclaiming deletes through merges showed up in your blog post This Week in Elasticsearch and Apache Lucene - 2018-07-07 ?

If so then I managed to trace this to the
PR #32907. So the default value is 33% and based on what I see from the segment API analysis the average proportion between deleted and normal docs is about 30%. But at the same time I see this proportion for ES6 is just about 7%. From Lucene code I see I can't set this lower than 20% which is also enforced on the ES settings level. Based on the percentage I see for the both clusters it seems that the previous approach with weights and default 2 was more aggressive than the new one. At the same time the description of the proposed change in LUCENE-8263 says something very different ("The current TMP allows up to 50% deleted docs, which can be wasteful on large indexes."). I'm somehow confused. Is there any other settings that influence how aggressive the deleted docs reclaim happens?
At the same time interestingly that in the recently release Lucene 9.5 they have decreased the lower boundary to 5% and the default to 20% (see the PR #11831). Do you know any background of this change?

Otherwise could you please share links to read more details about the changes you are referring to?

This can show up like what you re seeing, which means you might want to increase the hardware resources available, especially the disk.

Unfortunately this must be something different I have plenty of spare capacity on the data nodes: the CPU utilization is max 15% (I over-provisioned the capacity greatly to eliminate this cause) and the data nodes have a local NVMe SSD with average write IOPS consumed of just 465 and the max over the last week of 638 on each data node. Disk write/read in bytes is also pretty low (avg write 15MB/s and read is literally few KB/s since the whole index is memory mapped).

Thank you in advance!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.