Bulk index slowing down as index size increases

Any words of wisdom for bulk indexing starting to slow significantly down as amount of data increases? The index I'm creating (starting from zero documents) has about 122 million documents on total, spread on 12 shards (1 replica) on three physical nodes (64 RAM + SSD each). Size on disk is 231 GB (without replicas). So on average one document consumes about 2kb.

There's no significant GC going on. And in the beginning (before the size of the index grows to about 13M documents) the indexing rate is nice: about 5600 doc/sec. However, then the speed starts to slow down. At 18M to 19M documents the speed has dropped to 435 doc/sec.

The indexing is being done from one server using round-robin to index data on each node in the cluster (in batches of 500 documents). I tried disabling the refresh interval, but it doesn't seem to have any/much effect.

Any tips? I assume Elasticsearch should be capable of bulk index speed clearly faster than 400 documents/second (like the speed initially is).

Elastic HQ diagnostics as a reference

Could you provide the output of _cat/shards/YOURINDEXNAME?v

That might give an indication of documents being distributed evenly or otherwise.

I assume you are specifying document IDs before sending data to Elasticsearch instead of letting Elasticsearch automatically assign it. If this is correct it is expected that indexing throughput drops over time as the shard grow in size as each insert by Elasticsearch need to be treated as a potential update. This means a read is required for every write and this slows down the more data you have.

bulk_test 3  p STARTED 1597660 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 3  r STARTED 1594132 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 7  r STARTED 1602340 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 7  p STARTED 1658428 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 4  r STARTED 1526284 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 4  p STARTED 1540368 2.6gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 5  p STARTED 1663810 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 5  r STARTED 1601404 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 1  r STARTED 1542631 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 1  p STARTED 1589674 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 6  p STARTED 1586528 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 6  r STARTED 1541061 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 2  r STARTED 1742290 2.8gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 2  p STARTED 1527647 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 9  p STARTED 1539265 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 9  r STARTED 1736065 2.8gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 8  r STARTED 1563906 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 8  p STARTED 1640588 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 r STARTED 1599626 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 p STARTED 1628775 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 11 p STARTED 1587225 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 11 r STARTED 1622362 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 0  p STARTED 1650475 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 0  r STARTED 1557574 2.7gb xxx.xxx.xxx.93 es-dev-node2

I'm aware dropping replicas might give some boost, but as the speed is dropping exponentially, it should not be the root cause.

You are correct. Thanks for the info! If I switch the Bulk API action from index to create, should that then result in a significant improvement?

Only if you let Elasticsearch create the IDs. If you assign the IDs it still needs to check if the document already exists.

I tried the same but allowing Elasticsearch to create the IDs, yet similar slow-down still happens.

Is there anything in the logs around e.g. long or frequent GC or perhaps slow merging? What does the cluster stats api give?

more document,the inverted index use more memory,it may cause more gc.

Tail of gc.log

See https://pastebin.com/m4URrs5T

Grep for Real

[2020-05-27T08:29:19.739+0000][5748][gc,cpu       ] GC(27488) User=0.34s Sys=0.00s Real=0.04s
[2020-05-27T08:29:19.888+0000][5748][gc,cpu       ] GC(27489) User=0.03s Sys=0.01s Real=0.01s
[2020-05-27T08:29:19.992+0000][5748][gc,cpu       ] GC(27489) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:30:34.145+0000][5748][gc,cpu       ] GC(27490) User=0.32s Sys=0.02s Real=0.03s
[2020-05-27T08:31:17.405+0000][5748][gc,cpu       ] GC(27491) User=0.35s Sys=0.00s Real=0.03s
[2020-05-27T08:31:17.556+0000][5748][gc,cpu       ] GC(27492) User=0.03s Sys=0.00s Real=0.00s
[2020-05-27T08:31:17.662+0000][5748][gc,cpu       ] GC(27492) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:31:17.791+0000][5748][gc,cpu       ] GC(27493) User=0.35s Sys=0.00s Real=0.04s
[2020-05-27T08:31:17.943+0000][5748][gc,cpu       ] GC(27494) User=0.03s Sys=0.01s Real=0.00s
[2020-05-27T08:31:18.052+0000][5748][gc,cpu       ] GC(27494) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:32:07.189+0000][5748][gc,cpu       ] GC(27495) User=0.47s Sys=0.00s Real=0.05s
[2020-05-27T08:33:26.162+0000][5748][gc,cpu       ] GC(27496) User=0.48s Sys=0.03s Real=0.05s
[2020-05-27T08:33:26.280+0000][5748][gc,cpu       ] GC(27497) User=0.03s Sys=0.00s Real=0.01s
[2020-05-27T08:33:26.377+0000][5748][gc,cpu       ] GC(27497) User=0.01s Sys=0.00s Real=0.00s
[2020-05-27T08:34:44.389+0000][5748][gc,cpu       ] GC(27498) User=0.51s Sys=0.03s Real=0.06s
[2020-05-27T08:36:03.023+0000][5748][gc,cpu       ] GC(27499) User=0.56s Sys=0.03s Real=0.06s
[2020-05-27T08:36:03.148+0000][5748][gc,cpu       ] GC(27500) User=0.03s Sys=0.00s Real=0.01s
[2020-05-27T08:36:03.244+0000][5748][gc,cpu       ] GC(27500) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:37:07.622+0000][5748][gc,cpu       ] GC(27501) User=0.63s Sys=0.00s Real=0.07s

At least the real consumed time doesn't seem too high.

I also double-checked that the amount of data doesn't drastically increase after about first 13M documents (after which the index speed decreases):

Phew, I think I finally found the root cause of the indexing speed slowing down. It's ICU4J transliteration, and the fact that after about 13.2M documents we start to have a lot of Chinese data.

I asked in a new post how to avoid duplicate transliterations (I am assuming the use of icu_transform in one property but with multiple fields results in running transliteration multiple times for the same identical text): ICU transform filters slowing down indexing: how avoid duplicate transliterations?


PS. @Christian_Dahlqvist feel free to continue with the good insights in the new post! :slight_smile:

1 Like

Have been following this thread. thanks for posting the updates!

1 Like

That must've been a tricky one to track down! Kudos for figuring it out and thanks for letting us know the outcome.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.