Hello All,
I am using bulk APIs to index data. I ran two instances of my java program which posts the data on _bulk url.
In first case, my average document size was 1550 bytes and I let this program run for 1 hour. At the end, I could see around 22 million documents created.
In second case, average document size was 550 bytes and I was expecting it to index around 3 times more number of documents but it happened to be just less than double. It was only 42 million in 1 hour. In both cases, I had 4 node cluster but index was created to have only 1 shard, no replica and shard was forced to be on a particular node ( using IP configuration).
In first case, there were 60 attributes in each document and most of them were strings, not_analyzed.
In second case, there were 19 attribute in each document and most of them were strings, not_analyzed.
If everything else remains same , then why indexing is not directly proportional to document size ? Thanks for any explanation.
It might be that the fields you are indexing in the second case are among the most costly fields to index. Moreover, indexing has some fixed costs that prevent from indexing 3X faster when documents are 3X smaller.
Something that you could look into since documents are 3x smaller would be to make the bulk size 3x larger. This will help Elasticsearch call fsync less often.
Another test I did meanwhile, I made all my string fields non-searchable with "index": "no" option. I have one IP field and 2 date fields which have their default index value. In this case it indexed 53 million documents. Document size was same as case 2 above.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.