We have seeded an index ~6 months ago. Back then the total space occupied was roughly 7TB of data.
Over the last 6 months we have been updating the index to keep it synced with the source data.
Over this period of 6 months the total index size has grown to 11TB.
Now we needed to reseed the data in a new index (because we wanted to support some more fields). The expectation was that it would total to ~11TB again.
Now it turns out that the total disk space of this new index is only 8.5TB. The total number of documents seeded match.
I can't explain where this large difference of 8.5TB versus the expected 11TB comes from.
Since you update the index I would expect that this extra size was occupied by the deleted documents.
Deleted documents are only removed when a segment merge, but there are some requirements before elasticsearch will merge a segment, like the shard size and percent of deleted documents, if you have a lot of shards it is possible that those requreriments were not reached yet and the segments didn't merge, so the deleted documents would still occupy some space.
When you indexed it in a new index the deleted documents were not indexed, so it would require less space.
Yeah, I think this explains the difference in size, the deleted documents weren't purged yet.
Elasticsearch automatically merge segments and purge the deleted documents, I'm not sure exactly how this is done, but for what I know it takes in consideration the size of shards and segments and the percent of deleted documents as triggers.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.