Comparison between index size and doc source size

Hello,

I have some questions regarding the mapper size plugin and how the size of the source of documents relate to the size of the index.

So, I installed the mapper size plugin on a single-node Elasticsearch cluster, I indexed some sample data (logs) and computed the average size of the documents source, by using this aggregation:

GET sample-logs/_search
{
	"aggs": {
		"avg_source_size": {
			"avg": {
				"field": "_size"
			}
		}
	}
}

Now I wanted to compare this result to the size of the index (the size of the primary shards).

Notes:

  • I'm using Elasticsearch v7.2.0
  • my test index has only 1 shard and 0 replicas.
  • I'm not using any declared mapping; I let Elasticsearch automatically detect new fields and field types and assign mapping dynamically (so, for example, all string values in the source are indexed as text fields with a keyword multifield)

First I used a test index that has around 1.5 million documents and with primary size around 1 GB.
With the Cat Indices API (or the Index Stats API) I can retrieve the number of documents and the size of the index:

GET _cat/indices/sample-logs?v&h=index,docs.count,pri.store.size

With these informations I computed the size of the index per document:

doc_size = pri.store.size/docs.count

This is supposed to be the space that a document in the index is occupying on disk, on average.

What I found is that the size of the index per document is smaller than the average doc source size.

I repeated this test, multiple times, with different index sizes and docs count (index sizes ranging from 150MB to 15 GB, and document count ranging from 400K to 30 million), and I consistently observed the same result.
Actually the difference between the average source size and the index size per document increases when the volume of data increases.

On average, I observed that the index size per document is around 65% smaller than the average doc source size.

But I expected a different result: I thought the size of the index is generally bigger than the total size of the raw data, since Elasticsearch does not just store the source but needs to create files that make the index data structure.

Or am I missing something?
Do I need to take into consideration a compression factor?
Does the store size that is returned by the Cat Index API takes into account all the contributions to the volume that an index occupies on disk?

Elasticsearch does apply compression internally, so the size of the source and the indexed data can be either larger or smaller than the raw data. It aprimarily depend on the data and the mappings used. Is this realistic data? Do the fields have a realistic distribution or are you just indexing the same documents over and over again?

Hi @Christian_Dahlqvist ,

yes, I'm using realistic data, in particular a sample of logs from kubernetes pods, from a production environment.
The source is something like this:

{
	"kubernetes": {
		"container_name": "...",
		"host": "...",
		"namespace_labels": {
			"name": "..."
		},
		"namespace_name": "...",
		"pod_name": "...",
		"labels": {
			"app": "...",
			"deployment": "...",
			...
		}
	},
	"cluster_name": "...",
	"environment": "...",
	"@timestamp": "...",
	"level": "...",
	"message": "<ACTUAL LOG MESSAGE>"
}

The log message can be very big in certain cases, composed of many lines.

so there is no known and fixed compression factor?
I guess, whether indexed data get compressed or expanded with respect to the raw version depends highly on its format and the mapping that you choose.

Elasticsearch is flexible and allows you to index data in a number of ways. How much space data takes up on disk compared to the raw size can therefore vary widely. The docs contains some pointers and there are also some blog posts, although most I have seen are quite old.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.