I am using bulk APIs to index data. I ran two instances of my java program which posts the data on _bulk url.
In first case, my average document size was 1550 bytes and I let this program run for 1 hour. At the end, I could see around 22 million documents created.
In second case, average document size was 550 bytes and I was expecting it to index around 3 times more number of documents but it happened to be just less than double. It was only 42 million in 1 hour. In both cases, I had 4 node cluster but index was created to have only 1 shard, no replica and shard was forced to be on a particular node ( using IP configuration).
In first case, there were 60 attributes in each document and most of them were strings, not_analyzed.
In second case, there were 19 attribute in each document and most of them were strings, not_analyzed.
If everything else remains same , then why indexing is not directly proportional to document size ? Thanks for any explanation.
There are many parameters that have a direct impact on the index size, mainly:
- the number of documents
- field types (doubles are more space-intensive than floats for instance)
- field cardinality
- field sparsity (number of docs that have the field vs number of docs in the index)
My second case document attributes are subset of first case document attributes.
It might be that the fields you are indexing in the second case are among the most costly fields to index. Moreover, indexing has some fixed costs that prevent from indexing 3X faster when documents are 3X smaller.
Something that you could look into since documents are 3x smaller would be to make the bulk size 3x larger. This will help Elasticsearch call fsync less often.
ok... thanks. I will try with 3x bulk size.
Another test I did meanwhile, I made all my string fields non-searchable with "index": "no" option. I have one IP field and 2 date fields which have their default index value. In this case it indexed 53 million documents. Document size was same as case 2 above.