Cloned/Split Indexes Take Double Disk Space When Increasing Shards

Are you asking about the total disk consumption as reported by the OS (e.g. using df) or do you mean just for the cloned/split index (e.g. using GET _cat/indices)? The latter double-counts the actual disk space used because of the use of hard links.

GET _cat/indices should report the size of a clone to be identical to the size of the original index.

Splitting the index works by cloning all the shards (multiple times) and then effectively running a delete-by-query on them, which certainly increases the reported size until merging cleans up the deleted docs. If you're still writing to this index then that'll happen in time; if you're not still writing to this index then you can try force-merging to make it happen sooner. There's also some per-shard disk space overhead -- particularly the terms dictionary tends to be large and not to get much smaller after a split since most shards contain roughly the same set of terms.

2 Likes