Shard recovery blocks updates to cluster state?

mskr · August 3, 2015, 6:26pm

Hey, everyone!
Recently we needed to make a configuration change to the machines that host our Elasticsearch cluster.

We stopped the elasticsearch service on one of our machines (node1), changed its' config. During this clients continued to index data and perform cluster state updates normally. After restarting the elasticsearch service, we saw client requests time out while trying to perform a put-mapping request. IIRC, this continued for two or three minutes.

Here is an example of a failed request:

{"error":"RemoteTransportException[[Dreadknight][inet[/192.168.156.148:9300]][indices:admin/mapping/put]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [LoggingEvent]) within 30s]; ","status":503}

At that time the pending tasks queue (http://node2:9200/_cluster/pending_tasks) looked like this. The put-mapping task is a task generated by our client app.

{
  "tasks" : [ {
    "insert_order" : 3907,
    "priority" : "URGENT",
    "source" : "shard-started ([elbalogs23-07-15][0], node[m_fFX4RBTmSXvobO2rUI1Q], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[Dreadknight][HOwvbK5cS_ewV3fhm691pQ][node3][inet[/192.168.156.148:9300]]{master=true}]]",
    "executing" : true,
    "time_in_queue_millis" : 40098,
    "time_in_queue" : "40s"
  }, {
    "insert_order" : 3908,
    "priority" : "URGENT",
    "source" : "shard-started ([elbalogs29-07-15][4], node[m_fFX4RBTmSXvobO2rUI1Q], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[Slapstick][HTcJhWIYRcugOq-ea4k9og][node2][inet[/192.168.156.146:9300]]{master=true}]]",
    "executing" : false,
    "time_in_queue_millis" : 40096,
    "time_in_queue" : "40s"
  }, {
    "insert_order" : 3910,
    "priority" : "HIGH",
    "source" : "put-mapping [LoggingEvent]",
    "executing" : false,
    "time_in_queue_millis" : 19077,
    "time_in_queue" : "19s"
  }

I realize that we did not follow best practices for a rolling restart as described here (https://www.elastic.co/guide/en/elasticsearch/guide/current/_rolling_restarts.html), but instead we have accidentally simulated an ungraceful node resart. It seems strange that a single node failure effectively blocked updates to the cluster state.

Could anyone help me understand the behavior we encountered?
What are shard-started tasks and what causes them to remain in the queue for a long time?
Is this the intended behavior?
Is there a way to mitigate this (an unexpected shutdown then restart)?

Thanks in advance:)

Topic		Replies	Views
Mapping update timeout when doing shard recovering Elasticsearch	4	364	April 25, 2019
2.2.0: put-mapping errors Elasticsearch	1	572	October 19, 2017
Process Cluster Event Timeout Exception on put-mapping Elasticsearch	12	10025	May 31, 2018
Failed to process cluster event (put-mapping) within 30s at UTC 00:00:00 Elasticsearch	5	880	June 11, 2020
Getting exception Process ClusterEvent Timeout Exception after 5 minutes Elasticsearch	3	372	October 14, 2019

Shard recovery blocks updates to cluster state?

Related topics