This was discussed in this webinar and this blog post, and is basically a way to reason about how raw data transforms into indexed data on disk. Often we first convert the raw data into JSON documents, and how this changes size depends on how we parse and structure the data as well as how much enrichment data we add. This is what we often refer to as the
JSON conversion factor, and can vary a lot between different types of data.
Once we index this into Elasticsearch, the size will change again. This will typically depend on data itself, the mappings used as well as index settings and shard size. This is what we refer to as the
Index conversion factor.
To get the size of primary shards on disk we basically take the raw data volume and multiply it by these two factors. The reason this was picked is that it is relatively easy to test and use in benchmarks.