Comparison between index size and doc source size

AlessandroKP · March 27, 2023, 9:30am

Hello,

I have some questions regarding the mapper size plugin and how the size of the source of documents relate to the size of the index.

So, I installed the mapper size plugin on a single-node Elasticsearch cluster, I indexed some sample data (logs) and computed the average size of the documents source, by using this aggregation:

GET sample-logs/_search
{
	"aggs": {
		"avg_source_size": {
			"avg": {
				"field": "_size"
			}
		}
	}
}

Now I wanted to compare this result to the size of the index (the size of the primary shards).

Notes:

I'm using Elasticsearch v7.2.0
my test index has only 1 shard and 0 replicas.
I'm not using any declared mapping; I let Elasticsearch automatically detect new fields and field types and assign mapping dynamically (so, for example, all string values in the source are indexed as text fields with a keyword multifield)

First I used a test index that has around 1.5 million documents and with primary size around 1 GB.
With the Cat Indices API (or the Index Stats API) I can retrieve the number of documents and the size of the index:

GET _cat/indices/sample-logs?v&h=index,docs.count,pri.store.size

With these informations I computed the size of the index per document:

doc_size = pri.store.size/docs.count

This is supposed to be the space that a document in the index is occupying on disk, on average.

What I found is that the size of the index per document is smaller than the average doc source size.

I repeated this test, multiple times, with different index sizes and docs count (index sizes ranging from 150MB to 15 GB, and document count ranging from 400K to 30 million), and I consistently observed the same result.
Actually the difference between the average source size and the index size per document increases when the volume of data increases.

On average, I observed that the index size per document is around 65% smaller than the average doc source size.

But I expected a different result: I thought the size of the index is generally bigger than the total size of the raw data, since Elasticsearch does not just store the source but needs to create files that make the index data structure.

Or am I missing something?
Do I need to take into consideration a compression factor?
Does the store size that is returned by the Cat Index API takes into account all the contributions to the volume that an index occupies on disk?

Christian_Dahlqvist · March 27, 2023, 9:42am

Elasticsearch does apply compression internally, so the size of the source and the indexed data can be either larger or smaller than the raw data. It aprimarily depend on the data and the mappings used. Is this realistic data? Do the fields have a realistic distribution or are you just indexing the same documents over and over again?

AlessandroKP · March 27, 2023, 9:48am

Hi @Christian_Dahlqvist ,

yes, I'm using realistic data, in particular a sample of logs from kubernetes pods, from a production environment.
The source is something like this:

{
	"kubernetes": {
		"container_name": "...",
		"host": "...",
		"namespace_labels": {
			"name": "..."
		},
		"namespace_name": "...",
		"pod_name": "...",
		"labels": {
			"app": "...",
			"deployment": "...",
			...
		}
	},
	"cluster_name": "...",
	"environment": "...",
	"@timestamp": "...",
	"level": "...",
	"message": "<ACTUAL LOG MESSAGE>"
}

The log message can be very big in certain cases, composed of many lines.

AlessandroKP · March 27, 2023, 5:10pm

so there is no known and fixed compression factor?
I guess, whether indexed data get compressed or expanded with respect to the raw version depends highly on its format and the mapping that you choose.

Christian_Dahlqvist · March 27, 2023, 6:18pm

Elasticsearch is flexible and allows you to index data in a number of ways. How much space data takes up on disk compared to the raw size can therefore vary widely. The docs contains some pointers and there are also some blog posts, although most I have seen are quite old.

system · April 24, 2023, 6:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Determining Size of A Document Elasticsearch	6	6487	May 10, 2019
Mapper Size Plugin VS real Index Size Elasticsearch	4	453	December 21, 2021
Size of Index Elasticsearch	8	16420	July 5, 2017
Actual Size of a document - Mapper Size Plugin Elasticsearch	1	407	March 27, 2023
ElasticSearch pri.store.size Elasticsearch	3	1061	January 13, 2023

Comparison between index size and doc source size

Related topics