Slow index creation and indexing when many indices must be created at once

(Michael) #1


I'm using daily indices and index templates in my Elasticsearch environment. The time is based on a field in the document.

Now I have I problem with bulk indexing:
For example if I import data from 1 year, 365 indices must be created in short time:


This results in poor performance on index creation and indexing, also if there is only 1 document per day.

My cluster: 3 ES Nodes, v5.4.0
Index settings: 2 shards, 0 replicas (during bulk index)

I have watched the creation process on shard level and found out:

  • Maximum of 12 shards are INITIALIZING at the same time (is there a configurable limit?)
  • All other shards follow with UNASSIGNED and will be processed one after another

name-2014-03-10 1 p INITIALIZING 0 130b node1
name-2014-03-10 0 p INITIALIZING 0 130b node2
... only up to 12
name-2014-01-02 1 p STARTED 144 40.2kb node2
name-2014-01-02 0 p STARTED 144 35.1kb node3
name-2014-02-01 1 p STARTED 146 19.3kb node1
name-2014-02-01 0 p STARTED 142 17.7kb node2
name-2014-02-05 1 p STARTED 147 20.4kb node1
name-2014-02-05 0 p STARTED 141 18kb node2
name-2014-02-20 1 p STARTED 143 17.6kb node2
name-2014-02-20 0 p STARTED 145 26.6kb node3
... started shards
name-2014-03-20 1 p UNASSIGNED
name-2014-03-20 1 p UNASSIGNED
name-2014-03-21 1 p UNASSIGNED
name-2014-03-21 1 p UNASSIGNED
name-2014-03-22 1 p UNASSIGNED
name-2014-03-22 1 p UNASSIGNED
... other shards waiting

So I took a while until all shards are started. After that, the indexing process is fast again.

Is there a way to optimize this?

(Christian Dahlqvist) #2

It looks like your indices and shards are very small. Consider changing to using indices covering a longer time period, e.g. monthly or even yearly indices. Aim to get the average shard size up to at least a few GB in size.

(Michael) #3

Thanks for your answer, but the indices have a few GB in size (the log is at the beginning of the bulk process).

But that has nothing to do with the problem of creating many indices at once.

(Aaron Mildenstein) #4

It actually does, but perhaps not in the way you might think.

Each shard has a management cost in Elasticsearch. At a certain point, a significant portion of your heap will be locked up, just for keeping tabs on your shards. When that happens, everything slows down. Indexing takes a hit. Any operation that requires a read from or write to the cluster metadata will be slowed down, including adding more indices.

There are no hard rules here, but with a 30G heap, I would not want more than 900 - 1200 shards per node. 600 is a pretty safe bet. That safe number falls off pretty dramatically if your heap sizes are smaller than that.

You can complain that 12 shards initializing at once is limiting you, but the recommendation to reduce shard count will save you a lot of headache down the road.

Why does a node restart reclaim heap?
(Michael) #5

Thank you for the explanation.

Are your recommendation about number if shards are about primary shards or total number of shards

(Aaron Mildenstein) #6

Total number of shards.

(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.