Not sure it's totally accurate here as I'm over simplifying but let me tell you a bit of what could happen behind the scene. Let say that you have an index with the following terms appearing in the following documents:
azertyuiop: A, B, C, D, E
qsdfghjklm: A, E
wxcvbn: B, C, E
If you have one shard, you will create a single shard which has the following data structure on disk:
azertyuiop: A, B, C, D, E
qsdfghjklm: A, E
wxcvbn: B, C, E
If you have 5 shards and let say that A goes to shard 0, B to shard 1, ... E to shard 4, you will have the following data structure on disk
Shard 0:
azertyuiop: A
qsdfghjklm: A
Shard 1:
azertyuiop: B
wxcvbn: B
Shard 2:
azertyuiop: C
wxcvbn: C
Shard 3:
azertyuiop: D
Shard 4:
azertyuiop: E
qsdfghjklm: E
wxcvbn: E
As you can see, we have duplicated multiple times the same index entry. So that explains why you won't have exactly the size split by 5.
But, some few things to take into account as well. Do you run _forcemerge to reduce the number of segments to 1 before measuring the size? This operation reduces the used space as the each segment contains the list of the terms used in the segment. Reducing to 1 would help. Also if you are not only appending data, but updating or deleting, using the Force Merge API helps a lot.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.