We change our mapping and this impacts the indices store sizes and i try to figure the size increases.
I am not sure to understand the output precisely.
Provided size are size of data or size on disk ?
Primary store size is the mean size of one primary or the size of all primaries ?
I guess that primary store size is the size of all primaries and the store size should be equal to : (1+nb of replica)*(primary store size) is it right ?
But in fact this is not the case, store size is almost equal to (1+nb of replica)*(primary store size) but not exactly.
Last point to evaluate the mean size of one document, is it possible to compute : (primary store size)/(number of doc+number of deleted docs) ?
I understood that deleted docs are still in the indices but no more accessible.
You should see two columns: store.size and pri.store.size. store.size is the total size (primaries and replicas), whereas pri.store.size is the size of primaries alone. This is storage (disk) size.
Notice that you can get help on the column headers using:
GET _cat/indices?help
Notice that replicas and primary are not necessarily identically sized. It is expected that the formula you have do not give the exact number. There are multiple reasons for this, since each replica/primary in many ways works independently, maintaining their own lucene index for the shard. For instance, indexing does not necessarily happen in the same order and merging may kick in at different times.
I am not sure what exactly you want out of the avg size calculation. You can do the avg as you describe it and it will then give you the avg size over current docs and deletions. Notice that deletions occur for two reasons: updates and deletes. If you want to use this for forecasting storage use, it might make sense to use the more conservative (size/number of docs) instead in order to not shoot under the target, but it does depend on whether you expect many updates/deletes or not.
the way Lucene stores docs is complicated and I think trying to compare between one data set with the old mappings and deletions in it vs another data set with the new mappings and no deletions will be hard and likely error prone.
I think a better approach is to pick a relevant subset of data, index it into an index using the old mapping and index the same subset of data into another index using the new mapping. Then compare sizes of those two indices. The data set has to have a "good" size, i.e. not be unrealistically small and also not too large so that the exercise takes too long. You could let your target index have only one primary shard and target 50GB for the experiment. You can use reindex to copy the data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.