Cluster upgrade from 5.4 -> 5.6 doubled disk usage

Was requested to post a discussion topic to start rather than a github issue.

Original link: https://github.com/elastic/elasticsearch/issues/50323

The problem, as described in the issue, is that following the standard cluster upgrade guide from nodes on version 5.4 to 5.6, disk usage on each node doubled. This disk usage has not gone dropped even after the entire cluster finished the upgrade.

Some of the indices were originally created under version 5.3, some under version 5.4.

The question is: how can I force elasticsearch to clean up the disk usage? I assume there wasn't some atrocious change in the underlying data format that ballooned disk usage in version 5.6.

I also don't know very much about the ES data structure, so I don't know how to tell which shards are "leftovers".

Are you also the poster of this similar-looking (but more detailed) question on Stack Overflow?

As mentioned there, I recommend to use the index stats APIs to work out the detail of what's going on.

Yes, that's correct.

I appreciate the advice not to manually delete the generated directories.

If you could clarify what exactly I'm looking for when I look at index stats, that would help. I have a lot of indices, and there's a lot of stats, and I don't really know what I should be looking for.

And in case this needs more clarification, the cluster is green, disk usage doubled after the upgrade, and has not gone down, and it has been days since the cluster was completely upgraded and green.

The two main components of each shard's disk usage are the store and the translog, whose sizes are reported in the indices stats. I would check that these numbers seem reasonable to you and, if possible, compare them to any stats you might have from before the upgrade to help pinpoint exactly what has got larger. I would expect the translog of indices that haven't recently seen write activity to be pretty tiny, and that the total of all this disk usage corresponds closely with the amount of space actually consumed on disk.

Size among different indices can vary from a few Mb to hundreds of Gb, so I don't really have a benchmark of reasonable for the storage size, unfortunately.

Translog values are low.

I guess maybe there's an underlying miscommunication about this: I'm not really trying to prove to myself that disk usage doubled after the upgrade. I know that's what happened, I saw it happen on eight nodes. Assuming this is abnormal behavior, how do I fix it? How do I get elasticsearch to cleanup the old data? If this is not abnormal behavior, that means ES ballooned the index size to double using the 5.6 data format, which seems really unlikely.

Another way to put it: I can spend time searching through index stats for hundreds of indices and looking at their store and translog values, but why? What's the endgame here?

Yes, this behaviour sounds abnormal to me. Unfortunately without knowing any detail about what it is that's consuming the extra disk space it isn't really possible to offer any advice about what action you should take. Total disk usage is a very coarse measure. Is each shard twice as large as before or are they all the same size? Maybe they're the same size but there's twice as many of them? Maybe there's something that only affects a small subset of the shards in a very severe way? Maybe Elasticsearch thinks it's using the same disk usage as before and the extra disk usage is something else entirely?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.