Hello,
I have some questions regarding the mapper size
plugin and how the size of the source of documents relate to the size of the index.
So, I installed the mapper size
plugin on a single-node Elasticsearch cluster, I indexed some sample data (logs) and computed the average size of the documents source, by using this aggregation:
GET sample-logs/_search
{
"aggs": {
"avg_source_size": {
"avg": {
"field": "_size"
}
}
}
}
Now I wanted to compare this result to the size of the index (the size of the primary shards).
Notes:
- I'm using Elasticsearch v7.2.0
- my test index has only 1 shard and 0 replicas.
- I'm not using any declared mapping; I let Elasticsearch automatically detect new fields and field types and assign mapping dynamically (so, for example, all string values in the source are indexed as text fields with a keyword multifield)
First I used a test index that has around 1.5 million documents and with primary size around 1 GB.
With the Cat Indices API (or the Index Stats API) I can retrieve the number of documents and the size of the index:
GET _cat/indices/sample-logs?v&h=index,docs.count,pri.store.size
With these informations I computed the size of the index per document:
doc_size = pri.store.size/docs.count
This is supposed to be the space that a document in the index is occupying on disk, on average.
What I found is that the size of the index per document is smaller than the average doc source size.
I repeated this test, multiple times, with different index sizes and docs count (index sizes ranging from 150MB to 15 GB, and document count ranging from 400K to 30 million), and I consistently observed the same result.
Actually the difference between the average source size and the index size per document increases when the volume of data increases.
On average, I observed that the index size per document is around 65% smaller than the average doc source size.
But I expected a different result: I thought the size of the index is generally bigger than the total size of the raw data, since Elasticsearch does not just store the source but needs to create files that make the index data structure.
Or am I missing something?
Do I need to take into consideration a compression factor?
Does the store size that is returned by the Cat Index API takes into account all the contributions to the volume that an index occupies on disk?