Any words of wisdom for bulk indexing starting to slow significantly down as amount of data increases? The index I'm creating (starting from zero documents) has about 122 million documents on total, spread on 12 shards (1 replica) on three physical nodes (64 RAM + SSD each). Size on disk is 231 GB (without replicas). So on average one document consumes about 2kb.
There's no significant GC going on. And in the beginning (before the size of the index grows to about 13M documents) the indexing rate is nice: about 5600 doc/sec. However, then the speed starts to slow down. At 18M to 19M documents the speed has dropped to 435 doc/sec.
The indexing is being done from one server using round-robin to index data on each node in the cluster (in batches of 500 documents). I tried disabling the refresh interval, but it doesn't seem to have any/much effect.
Any tips? I assume Elasticsearch should be capable of bulk index speed clearly faster than 400 documents/second (like the speed initially is).
I assume you are specifying document IDs before sending data to Elasticsearch instead of letting Elasticsearch automatically assign it. If this is correct it is expected that indexing throughput drops over time as the shard grow in size as each insert by Elasticsearch need to be treated as a potential update. This means a read is required for every write and this slows down the more data you have.
bulk_test 3 p STARTED 1597660 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 3 r STARTED 1594132 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 7 r STARTED 1602340 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 7 p STARTED 1658428 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 4 r STARTED 1526284 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 4 p STARTED 1540368 2.6gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 5 p STARTED 1663810 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 5 r STARTED 1601404 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 1 r STARTED 1542631 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 1 p STARTED 1589674 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 6 p STARTED 1586528 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 6 r STARTED 1541061 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 2 r STARTED 1742290 2.8gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 2 p STARTED 1527647 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 9 p STARTED 1539265 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 9 r STARTED 1736065 2.8gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 8 r STARTED 1563906 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 8 p STARTED 1640588 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 r STARTED 1599626 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 p STARTED 1628775 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 11 p STARTED 1587225 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 11 r STARTED 1622362 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 0 p STARTED 1650475 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 0 r STARTED 1557574 2.7gb xxx.xxx.xxx.93 es-dev-node2
I'm aware dropping replicas might give some boost, but as the speed is dropping exponentially, it should not be the root cause.
Phew, I think I finally found the root cause of the indexing speed slowing down. It's ICU4J transliteration, and the fact that after about 13.2M documents we start to have a lot of Chinese data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.