Dear Community,
I have the following issue since few days i have some unassigned shards due to allocation faillure, so the cluster goes to yellow and sometimes the cluster go to red as the elasticsearch service goes down.
I have the following architecture :
1 Master node
2 Data node (include the master)
2 Client Node
Total 4 Elasticsearch nodes.
1.7 k indices
1.9 b documents
2.9 TB of data
The configuration is the following on the data nodes : 5 Shards + 1 Complete Replica
[root@elastic-xx ~]# curl -X GET http://elastic-xx.domain.local:9200/_cluster/allocation/explain?pretty
{
"index" : "winlogbeat-2018.08.20",
"shard" : 2,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2018-08-20T18:59:53.330Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "QrlS-a6EThqug57OnWPdmg",
"node_name" : "elastic-xx",
"transport_address" : "xxx.xxx.xxx.xxx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-08-20T18:59:53.330Z], failed_attempts[5], delayed=false, details[failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{10.10.68.24}{10.10.68.24:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ], allocation_status[no_attempt]]]"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[winlogbeat-2018.08.20][2], node[QrlS-a6EThqug57OnWPdmg], [P], s[STARTED], a[id=EFCtso3bSP2dDU9VbbDfOw]]"
}
]
},
{
"node_id" : "uq3vVKvkRPKC94OOgfyrCA",
"node_name" : "elastic-xx",
"transport_address" : "xxx.xxx.xxx.xxx:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-08-20T18:59:53.330Z], failed_attempts[5], delayed=false, details[failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-01][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ], allocation_status[no_attempt]]]"
}
]
}
]
}
All the nodes in the cluster have the max_file_descriptors to 65536
{
"nodes": {
"DIyPbW4WQoSHFrtAYc0gmA": {
"process": {
"max_file_descriptors": 65536
}
},
"QrlS-a6EThqug57OnWPdmg": {
"process": {
"max_file_descriptors": 65536
}
},
"uq3vVKvkRPKC94OOgfyrCA": {
"process": {
"max_file_descriptors": 65536
}
},
"zxSDt-cRTeOIQwNWzJgZWA": {
"process": {
"max_file_descriptors": 65536
}
}
}
}
I thought addind data node , but i do not want to have more replica and consume much storage.
Do you have any recomendations on how to optimise and fix the issue ?
Thanks a lot for your help.
Best Regards, Edouard Fazenda.