Bulk index slowing down as index size increases

Pyppe · May 26, 2020, 9:16am

Any words of wisdom for bulk indexing starting to slow significantly down as amount of data increases? The index I'm creating (starting from zero documents) has about 122 million documents on total, spread on 12 shards (1 replica) on three physical nodes (64 RAM + SSD each). Size on disk is 231 GB (without replicas). So on average one document consumes about 2kb.

There's no significant GC going on. And in the beginning (before the size of the index grows to about 13M documents) the indexing rate is nice: about 5600 doc/sec. However, then the speed starts to slow down. At 18M to 19M documents the speed has dropped to 435 doc/sec.

The indexing is being done from one server using round-robin to index data on each node in the cluster (in batches of 500 documents). I tried disabling the refresh interval, but it doesn't seem to have any/much effect.

Any tips? I assume Elasticsearch should be capable of bulk index speed clearly faster than 400 documents/second (like the speed initially is).

Elastic HQ diagnostics as a reference

whatgeorgemade · May 26, 2020, 9:47am

Could you provide the output of _cat/shards/YOURINDEXNAME?v

That might give an indication of documents being distributed evenly or otherwise.

Christian_Dahlqvist · May 26, 2020, 9:59am

I assume you are specifying document IDs before sending data to Elasticsearch instead of letting Elasticsearch automatically assign it. If this is correct it is expected that indexing throughput drops over time as the shard grow in size as each insert by Elasticsearch need to be treated as a potential update. This means a read is required for every write and this slows down the more data you have.

Pyppe · May 26, 2020, 10:25am

bulk_test 3  p STARTED 1597660 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 3  r STARTED 1594132 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 7  r STARTED 1602340 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 7  p STARTED 1658428 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 4  r STARTED 1526284 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 4  p STARTED 1540368 2.6gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 5  p STARTED 1663810 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 5  r STARTED 1601404 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 1  r STARTED 1542631 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 1  p STARTED 1589674 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 6  p STARTED 1586528 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 6  r STARTED 1541061 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 2  r STARTED 1742290 2.8gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 2  p STARTED 1527647 2.6gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 9  p STARTED 1539265 2.6gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 9  r STARTED 1736065 2.8gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 8  r STARTED 1563906 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 8  p STARTED 1640588 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 r STARTED 1599626 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 10 p STARTED 1628775 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 11 p STARTED 1587225 2.7gb xxx.xxx.xxx.94 es-dev-node1
bulk_test 11 r STARTED 1622362 2.7gb xxx.xxx.xxx.93 es-dev-node2
bulk_test 0  p STARTED 1650475 2.7gb xxx.xxx.xxx.95 es-dev-node0
bulk_test 0  r STARTED 1557574 2.7gb xxx.xxx.xxx.93 es-dev-node2

I'm aware dropping replicas might give some boost, but as the speed is dropping exponentially, it should not be the root cause.

Pyppe · May 26, 2020, 10:28am

You are correct. Thanks for the info! If I switch the Bulk API action from index to create, should that then result in a significant improvement?

Christian_Dahlqvist · May 26, 2020, 10:49am

Only if you let Elasticsearch create the IDs. If you assign the IDs it still needs to check if the document already exists.

Pyppe · May 26, 2020, 6:38pm

I tried the same but allowing Elasticsearch to create the IDs, yet similar slow-down still happens.

Christian_Dahlqvist · May 27, 2020, 7:42am

Is there anything in the logs around e.g. long or frequent GC or perhaps slow merging? What does the cluster stats api give?

KellyK313 · May 27, 2020, 8:25am

more document，the inverted index use more memory，it may cause more gc.

Pyppe · May 27, 2020, 8:40am

Tail of `gc.log`

See https://pastebin.com/m4URrs5T

Grep for `Real`

[2020-05-27T08:29:19.739+0000][5748][gc,cpu       ] GC(27488) User=0.34s Sys=0.00s Real=0.04s
[2020-05-27T08:29:19.888+0000][5748][gc,cpu       ] GC(27489) User=0.03s Sys=0.01s Real=0.01s
[2020-05-27T08:29:19.992+0000][5748][gc,cpu       ] GC(27489) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:30:34.145+0000][5748][gc,cpu       ] GC(27490) User=0.32s Sys=0.02s Real=0.03s
[2020-05-27T08:31:17.405+0000][5748][gc,cpu       ] GC(27491) User=0.35s Sys=0.00s Real=0.03s
[2020-05-27T08:31:17.556+0000][5748][gc,cpu       ] GC(27492) User=0.03s Sys=0.00s Real=0.00s
[2020-05-27T08:31:17.662+0000][5748][gc,cpu       ] GC(27492) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:31:17.791+0000][5748][gc,cpu       ] GC(27493) User=0.35s Sys=0.00s Real=0.04s
[2020-05-27T08:31:17.943+0000][5748][gc,cpu       ] GC(27494) User=0.03s Sys=0.01s Real=0.00s
[2020-05-27T08:31:18.052+0000][5748][gc,cpu       ] GC(27494) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:32:07.189+0000][5748][gc,cpu       ] GC(27495) User=0.47s Sys=0.00s Real=0.05s
[2020-05-27T08:33:26.162+0000][5748][gc,cpu       ] GC(27496) User=0.48s Sys=0.03s Real=0.05s
[2020-05-27T08:33:26.280+0000][5748][gc,cpu       ] GC(27497) User=0.03s Sys=0.00s Real=0.01s
[2020-05-27T08:33:26.377+0000][5748][gc,cpu       ] GC(27497) User=0.01s Sys=0.00s Real=0.00s
[2020-05-27T08:34:44.389+0000][5748][gc,cpu       ] GC(27498) User=0.51s Sys=0.03s Real=0.06s
[2020-05-27T08:36:03.023+0000][5748][gc,cpu       ] GC(27499) User=0.56s Sys=0.03s Real=0.06s
[2020-05-27T08:36:03.148+0000][5748][gc,cpu       ] GC(27500) User=0.03s Sys=0.00s Real=0.01s
[2020-05-27T08:36:03.244+0000][5748][gc,cpu       ] GC(27500) User=0.00s Sys=0.00s Real=0.00s
[2020-05-27T08:37:07.622+0000][5748][gc,cpu       ] GC(27501) User=0.63s Sys=0.00s Real=0.07s

At least the real consumed time doesn't seem too high.

Pyppe · May 27, 2020, 8:59am

I also double-checked that the amount of data doesn't drastically increase after about first 13M documents (after which the index speed decreases):

Pyppe · May 27, 2020, 1:34pm

Phew, I think I finally found the root cause of the indexing speed slowing down. It's ICU4J transliteration, and the fact that after about 13.2M documents we start to have a lot of Chinese data.

I asked in a new post how to avoid duplicate transliterations (I am assuming the use of icu_transform in one property but with multiple fields results in running transliteration multiple times for the same identical text): ICU transform filters slowing down indexing: how avoid duplicate transliterations?

Thanks!

PS. @Christian_Dahlqvist feel free to continue with the good insights in the new post!

Rahul_Kumar4 · May 27, 2020, 2:00pm

Have been following this thread. thanks for posting the updates!

whatgeorgemade · May 28, 2020, 4:32pm

That must've been a tricky one to track down! Kudos for figuring it out and thanks for letting us know the outcome.

system · June 25, 2020, 4:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow bulk indexing Elasticsearch	4	2097	July 5, 2017
Bulk indexing size? Elasticsearch	5	346	July 6, 2017
Bulk indexing slow down when data amount increase Elasticsearch	6	2987	July 6, 2017
Improving Bulk Indexing Elasticsearch	12	4534	July 6, 2017
How Can I increase ES's indexing Data speed?Bulk can't achieve it! Elasticsearch	12	1305	July 5, 2017

Bulk index slowing down as index size increases

Elastic HQ diagnostics as a reference

Tail of gc.log

Grep for Real

Related topics

Tail of `gc.log`

Grep for `Real`