In our production cluster on AWS we have 8 i3.xlarge ingest+data nodes and 3 m3.medium master nodes. Indices have
translog.durability=async, 6 shards and 2 replicas.
We have a service called persister that takes the data from Redis and persist them to Elastic in batch requests. Each persister run takes around 2-3 minutes and it indexes 150k items in bulk requests containing 2k items. After the index phase it refreshes the indices one by one. Usually ca 6 indices are affected.
What I'm not happy about is that the refresh phase takes usually around 80% of the time and usually it times out after 60s for the largest indices.
What is affecting the refresh time most and how could we improve it? Does our index/refresh flow makes sense or are there some other best practices that we should follow?
More things that could be important:
Except the persister there's one other service indexing data to elastic. It runs in 10 instances and every 60s every instance indexes around 10k items to the same indices as the persister. I doesn't force any refresh.
Data in shards are not distributed equally (eg. 23.8gb, 45.5gb, 30.6gb, 13.9gb, 12.1gb, 21.1gb). This is probably because we're using routing for all items and some routing groups have much more data than others.
We create a new indices every week and index new data to these new indices.
index pri rep docs.count store.size pri.store.size tm pri.tm xxx_2018-10-15 5 2 514448277 691.7gb 230.5gb 3.5gb 1.1gb xxx_2018-10-29 6 2 460991107 610.4gb 203.4gb 3.2gb 1gb xxx_2018-10-08 5 1 547700252 490.3gb 245.1gb 2.7gb 1.3gb yyy_2018-10-29 6 1 332578930 301.8gb 150.9gb 2.2gb 1gb xxx_2018-11-05 6 2 316782562 424.2gb 144.4gb 2.1gb 725.6mb xxx_2018-10-22 6 1 333790508 298.2gb 149.1gb 1.6gb 826.2mb
We have one
nestedproperty and we use parent/child relations between those 4 types.
We use Elasticsearch 5.5 and have 4 types. The largest type has 45 properties.