Indexing not proportional to document size

vinayakshukre · August 19, 2016, 12:31pm

Hello All,
I am using bulk APIs to index data. I ran two instances of my java program which posts the data on _bulk url.
In first case, my average document size was 1550 bytes and I let this program run for 1 hour. At the end, I could see around 22 million documents created.
In second case, average document size was 550 bytes and I was expecting it to index around 3 times more number of documents but it happened to be just less than double. It was only 42 million in 1 hour. In both cases, I had 4 node cluster but index was created to have only 1 shard, no replica and shard was forced to be on a particular node ( using IP configuration).

In first case, there were 60 attributes in each document and most of them were strings, not_analyzed.
In second case, there were 19 attribute in each document and most of them were strings, not_analyzed.

If everything else remains same , then why indexing is not directly proportional to document size ? Thanks for any explanation.

jpountz · August 19, 2016, 1:04pm

There are many parameters that have a direct impact on the index size, mainly:

the number of documents
field types (doubles are more space-intensive than floats for instance)
field cardinality
field sparsity (number of docs that have the field vs number of docs in the index)

vinayakshukre · August 19, 2016, 1:15pm

My second case document attributes are subset of first case document attributes.

jpountz · August 19, 2016, 1:30pm

It might be that the fields you are indexing in the second case are among the most costly fields to index. Moreover, indexing has some fixed costs that prevent from indexing 3X faster when documents are 3X smaller.

Something that you could look into since documents are 3x smaller would be to make the bulk size 3x larger. This will help Elasticsearch call fsync less often.

vinayakshukre · August 19, 2016, 1:47pm

ok... thanks. I will try with 3x bulk size.

Another test I did meanwhile, I made all my string fields non-searchable with "index": "no" option. I have one IP field and 2 date fields which have their default index value. In this case it indexed 53 million documents. Document size was same as case 2 above.

Topic		Replies	Views
Bulk indexing size? Elasticsearch	5	363	July 6, 2017
Sweet spot for bulk indexing Elasticsearch	3	496	July 6, 2017
Size of Index Elasticsearch	8	16603	July 5, 2017
ElasticSearch bulk api performance Elasticsearch	7	2086	July 6, 2017
Indexing Performance vs Document Size Elasticsearch	4	1512	July 5, 2017

Indexing not proportional to document size

Related topics