Currently I have some large index's (100GB - 500GB) that have been allocated 5 shards. I would like to increase these to 10 - 20 shards each.
If I use the Split API or Clone API it takes 1-2 minutes for the new index to be created, but the disk space doubles if I set it to 10 shards, and quadruples if I set it to 20 shards. (Replicas are still 0). Is there a way to reduce the disk space? The document counts are the same, Im guessing it has to do with how it links to the physical files at the OS level. (If I mess around by opening/closing index etc. some times it will shrink closer to the initial value)
If I use the re-index api it uses about the same disk size, but it takes a very long time.(5-7 min/s per GB). Is there a way to speed this up?
Are you asking about the total disk consumption as reported by the OS (e.g. using df) or do you mean just for the cloned/split index (e.g. using GET _cat/indices)? The latter double-counts the actual disk space used because of the use of hard links.
GET _cat/indices should report the size of a clone to be identical to the size of the original index.
Splitting the index works by cloning all the shards (multiple times) and then effectively running a delete-by-query on them, which certainly increases the reported size until merging cleans up the deleted docs. If you're still writing to this index then that'll happen in time; if you're not still writing to this index then you can try force-merging to make it happen sooner. There's also some per-shard disk space overhead -- particularly the terms dictionary tends to be large and not to get much smaller after a split since most shards contain roughly the same set of terms.
I'm using the size reported in Kibana or via GET _cat/indicies. (I currently do not have access to the underlying file system).
I attempted a Force Merge Index command but that did not see to help, I also verified I set both indexes to writable, but no new data has been added to them.
I will wait another day to see if they get smaller or not.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.