I ran 'optimize' on old Logstash indices with 'max_num_segments=1' to improve search performance and maybe reclaim some disk space, however, I noticed that 'optimize' actually caused -most- indices to use more disk space; the cluster size went from 4TB to 4.4TB!
The cluster is running ES 1.7.2 on 5 nodes with 5 shards and 0 replicas, I'm wondering if this is expected behavior?
High cardinality fields and doc values can be a reason like Mark said. For instance if all your values are unique and you have two segments that have 1M unique values each, then the merged segment will have 2M unique values, which requires one more bit per document for addressing. These isn't really anything that can be done about it, this is just the way things are designed.
Another potential reason are sparse fields with doc values. For efficiency reasons, elasticsearch needs to reserve space for documents that don't have a value. Imagine you have 2 segments, segment 1 has values for field foo but segment 2 does not. So field foo does not require any disk space on segment 2, but as soon as you merge those segments, elasticsearch will suddenly need to reserve some space for all documents of segment 2 even though they don't have a value for 'foo'. This is something that we hope to improve soon in the extreme cases (when less than 1% of documents have a value for a given field). You can see https://issues.apache.org/jira/browse/LUCENE-6863 for more information.
How are you measuring the index size? Are you running ES on Windows (which refuses to delete still-open files)?
Be sure to flush after optimizing, otherwise the old segments may still be referenced (by either the last commit point, or the last refreshed reader, or both) and consuming disk space even though they are effectively "ghosts".
@mikemccand ES is running on CentOS, I'm calling 'stats' and using 'size_in_bytes' to get the index size. I didn't refresh because it defaults to true when running optimize as per the documentation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.