Hi All,
I am using ES 1.7.3. In my cluster I have a large number of updates daily. Ever since we switched from ES 1.3.9 to ES 1.7.3, we are starting to notice about 150% more disk usage as these updates happen. I looked at the indices and segments, and it turned out that additional disk usage is due to number of deleted ( ie updated) documents. If I use _optimize?only_expunge_deletes=true, I am able to reclaim disk space. but ES does not seem to reclaim disk space when merges occur...
This worked perfectly fine with no disk growth issues.
With ES 1.3.9 disk usage was not a problem, despite large number of updates happening, but with 1.7.3 I see disk usage creep up alarmingly.
I do have multi-gigabyte shards due to large data volume. This cluster also has heavy query rate ( about 4-5 thousand queries per second ). I use SSD backed machines, and haven't seen disk I/O or memory bottlenecks. Its only the disk usage that is the problem. How do I tune my merge policy to keep up with updates? I am ok if indexing is a little slow due to additional merges, but I cannot keep up with this kind of disk usage. Please advice!
Earlier I had all the default settings for ES 1.7, except
"index.store.throttle.type": "none"
When this index was created, the max_merged_segment was default ( 5gb). When I saw deletes getting accumulated, I set it to 1gb and tried to tune settings in order to recover space. But this did not work...The percentage of deleted documents is spread evenly across segments and is about 12%. Considering this, I have a couple of questions -
How much percentage of deleted docs is needed for a segment to be considered for merging?
If I had 1gb max_merged_segment right before loading data, would it help increasing percentage of deleted docs? Would reindexing help?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.