Very low speed resync after node stop/start

Hello,

I have 5 elasticsearch nodes in one cluster ( near 85.000.000 docs , 30Gb data ). After stop and start one server I saw, that resync started with very low speed and very high load at each of cluster nodes. Randomly that nodes go out of cluster and resync start again. What I can do for fix that trouble ?

{
"cluster_name" : "Prod",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 1863,
"active_shards" : 3294,
"relocating_shards" : 0,
"initializing_shards" : 20,
"unassigned_shards" : 6201,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 818,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 4071620,
"active_shards_percent_as_number" : 34.619022595901214
}

elasticsearch:
build: ./elastic
container_name: elastic
command: elasticsearch -Des.network.host=0.0.0.0
net: host
ports:
- "9200:9200"
- "9300:9300"
volumes:
- "/srv/docker/elastic/etc:/usr/share/elasticsearch/config"
- "/srv/docker/elastic/db:/usr/share/elasticsearch/data"
- "/srv/backup/elasticsearch:/backup"
restart: always

cluster.name: Prod
node.name: "node04"
http.port: 9200
network.host: non_loopback
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [
"node02",
"node03",
"node04",
"node05",
"node06"
]
transport.publish_host: 0.0.0.0
http.cors.enabled : true
http.cors.allow-origin : "*"
http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length

You have 9 495 shards on 5 nodes?
Which is about 2000 shards per node.

Would you run 2000 databases instances on one single physical machine?

That's a lot.

You did not give your version BTW. If you did not upgrade, upgrade.
Reduce the number of shards. May be you have 5 shards per index and 1 replica? If not needed, reduce that number.

You have here "number_of_pending_tasks" : 818. So I think you have to wait it recovers.

Also look at your logs. They will probably tell you what happened.

1 Like

Hello, thank you for response.

My dockerfile is

FROM elasticsearch:latest
RUN if [ ! -d /usr/share/elasticsearch/plugins/hq ]; then /usr/share/elasticsearch/bin/plugin install royrusso/elasticsearch-HQ; fi
RUN if [ ! -d /usr/share/elasticsearch/plugins/kopf ]; then /usr/share/elasticsearch/bin/plugin install lmenezes/elasticsearch-kopf/2.0; fi

my replica set is 4. We need all of indexes at all of servers.

But why so many primary shards then?

9 495 / 5 = 1899 primary shards. May be one shard per index: 1899 indices...

85.000.000 / 1899 = 44k docs per primary shard.
30 Gb / 1899 = 15 mb per primary shard

You can probably go up to 20 gb per shard. That's a lot of waste IMO.

1 Like

I solve my problem by double heap size at each node ( from 1g to 2g ) , cluster wasn't stuck at resync at done it well. Thank you.

Great but still. Reduce the number of shards.

1 Like

This time have :
{
"cluster_name" : "Prod",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 5,
"active_primary_shards" : 1918,
"active_shards" : 9515,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

with 30Gb data, 85.000.000 docs and 630 indicies. It is not optimal ? Indexes mask like log-date

Those are logs?

Why do you need 4 replicas then?

Considering that you only have 30GB of data, you have far too many indices and shards. With this amount of data each daily index should only have a single primary shard, and I would most likely recommend switching to e.g. monthly indices in order to increase the average shard size and thereby reduce the number of induces/shards that need to be managed and the overhead associated with these. If you are on Elasticsearch 5.x, you can use the shrink index API to get down to 1 primary shard per index. Even though it is getting a bit old, this blog post also contains some good points.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.