Cluster status turned to red after generated about 2TB indices data

netvmdb · October 10, 2018, 2:38pm

elasticsearch / logstash / kibana oss 6.2.4
23GB log/day. Initially the cluster was ok to ingest data, but slow on building large index.
But after ingest 2TB of index data, it slows down more. Eventually cluster status turned to red.
How can cluster recover from this and how to prevent this happening?

Thanks.

{"log":"[2018-10-09T22:13:30,824][WARN ][o.e.c.s.MasterService ] [7BNppEw] cluster state update task [shard-started shard id [[logstash-mail-2018.09.17][1]], allocation id [Jo_7mNnMSWKiLzrKB3FLjA], primary term [0], message [after existing recovery][shard id [[logstash-mail-2018.09.17][1]], allocation id [Jo_7mNnMSWKiLzrKB3FLjA], primary term [0], message [after existing recovery]], shard-started shard id [[logstash-cron-2018.09.17][2]], allocation id [QSpAD5I2SS-KrUdxudGeig], primary term [0], message [after existing recovery][shard id [[logstash-cron-2018.09.17][2]], allocation id [QSpAD5I2SS-KrUdxudGeig], primary term [0], message [after existing recovery]], shard-started shard id [[logstash-apache-error-2018.09.17][1]], allocation id [647TJjG4QzeLOmtOtFAHMA], primary term [0], message [after existing recovery][shard id [[logstash-apache-error-2018.09.17][1]], allocation id [647TJjG4QzeLOmtOtFAHMA], primary term [0], message [after existing recovery]]] took [39.3s] above the warn threshold of 30s\n","stream":"stdout","time":"2018-10-09T22:13:30.825547563Z"}
{"log":"[2018-10-09T22:13:30,824][WARN ][o.e.c.s.ClusterApplierService] [7BNppEw] cluster state applier task [apply cluster state (from master [master {7BNppEw}{7BNppEwTRNuuSREQEKewnA}{GToVKQqrTsGVoMY0GWivNQ}{192.168.144.2}{192.168.144.2:9300} committed version [98] source [shard-started shard id [[logstash-mail-2018.09.17][1]], allocation id [Jo_7mNnMSWKiLzrKB3FLjA], primary term [0], message [after existing recovery][shard id [[logstash-mail-2018.09.17][1]], allocation id [Jo_7mNnMSWKiLzrKB3FLjA], primary term [0], message [after existing recovery]], shard-started shard id [[logstash-cron-2018.09.17][2]], allocation id [QSpAD5I2SS-KrUdxudGeig], primary term [0], message [after existing recovery][shard id [[logstash-cron-2018.09.17][2]], allocation id [QSpAD5I2SS-KrUdxudGeig], primary term [0], message [after existing recovery]], shard-started shard id [[logstash-apache-error-2018.09.17][1]], allocation id [647TJjG4QzeLOmtOtFAHMA], primary term [0], message [after existing recovery][shard id [[logstash-apache-error-2018.09.17][1]], allocation id [647TJjG4QzeLOmtOtFAHMA], primary term [0], message [after existing recovery]]]])] took [39.1s] above the warn threshold of 30s\n","stream":"stdout","time":"2018-10-09T22:13:30.826204607Z"}

dakrone · October 10, 2018, 3:20pm

Hi Netvmdb,

These two warning aren't related to the red status, they have to do with GC taking longer than 30 seconds.

For determining why the cluster is red, I'd recommend the Cluster Allocation Explain API:
https://www.elastic.co/guide/en/elasticsearch/reference/6.4/cluster-allocation-explain.html

When running that, it should give you a big output explaining why an unassigned shard cannot be assigned.

netvmdb · October 10, 2018, 5:28pm

Run Allocation Explain API twice. Here are the results. How can I solve "stale or corrupt" shard and "primary shard existed but can no longer be found"?

curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'

Blockquote

{
"index":"logstash-mail-2018.09.22",
"shard":1,
"primary":true,
"current_state":"unassigned",
"unassigned_info":{
"reason":"CLUSTER_RECOVERED",
"at":"2018-10-10T15:45:21.335Z",
"last_allocation_status":"no_valid_shard_copy"
},
"can_allocate":"no_valid_shard_copy",
"allocate_explanation":"cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions":[
{
"node_id":"T5v8cMr0QB2lUXPdZtFj4A",
"node_name":"T5v8cMr",
"transport_address":"192.168.160.5:9300",
"node_decision":"no",
"store":{
"found":false
}
},
{
"node_id":"u7qoMkYzR1Wpieh8PWZWmw",
"node_name":"u7qoMkY",
"transport_address":"192.168.160.8:9300",
"node_decision":"no",
"store":{
"in_sync":false,
"allocation_id":"1vFOeAzoTdmcMnwPIY9uZQ"
}
}
]
}

Blockquote
{
"index" : "logstash-message-2018.09.01",
"shard" : 3,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2018-10-10T15:45:21.323Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions" : [
{
"node_id" : "T5v8cMr0QB2lUXPdZtFj4A",
"node_name" : "T5v8cMr",
"transport_address" : "192.168.160.5:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "u7qoMkYzR1Wpieh8PWZWmw",
"node_name" : "u7qoMkY",
"transport_address" : "192.168.160.8:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
}
]
}

system · November 7, 2018, 5:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.