Hi,
I'm trying to start my ES but as soon as it reach 80% (assigned shards) it goes back to around 50% and keep doing that indefinitely. It also take hours to reach 80% of assigned shards.
As the logs told us that we had to many open files we incremented the max file descriptor variable but this didn't solve the issue. What action do you recommend? we would like to save as much information as possible.
Thank you.
The cluster has no replicas and has 6 shards per index. With a total of 16672 shards.
The logs show the following Errors:
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([akainix-smg-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[4], s[STARTED], a[id=f_aQYmUEQtOQTBfPCY8ncg]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([itau-bluecoat-2017.06.14][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=haNxy7spSdyQrzagSY6Hqw]), message [master {elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true} marked shard as started, but shard has previous failed. resending shard failure.]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([euroamerica-sep-2017.06.17][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[16], s[STARTED], a[id=D7iq4f8dRpeshb5vtqax7Q]), message [master [{elastic-1}{jXLTdv8ISEmEEHMUJWSnWQ}{172.16.87.59}{172.16.87.59:9300}{master=true}] marked shard as started, but shard has not been created, mark shard as failed]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=vM4Mwu_0RYGfnVzsz4ZzPQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.03.04][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[3], s[INITIALIZING], a[id=fnnETNxfRA2lB3vvY0_F2w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2017-06-21T14:30:54.236Z]]), message [failed recovery]]
[2017-06-21 14:40:52,778][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([anonimo-monitoreo-2017.06.18][0], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[19], s[INITIALIZING], a[id=if-xP-xSSLq3ZytLcYAVeQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/anonimo-monitoreo-2017.06.18/0/_state: Too many open files]; ]]), message [failed recovery]]
[2017-06-21 14:40:52,779][ERROR][cluster.action.shard ] [elastic-1] unexpected failure during [shard-failed ([itau-monitoreo-2017.03.09][3], node[jXLTdv8ISEmEEHMUJWSnWQ], [P], v[15], s[INITIALIZING], a[id=Jtzidiw2TmONLM3z8-wneg], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-06-21T16:36:18.410Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: FileSystemException[/datos/elasticsearch/reportes/nodes/0/indices/itau-monitoreo-2017.03.09/3/_state: Too many open files]; ]]), message [failed recovery]]
My ES configuration is:
{
"cluster_name": "reportes",
"nodes": {
"WE4NoGaPRXW8qMiXTo8iAg": {
"timestamp": 1498074992865,
"name": "elastic-3",
"transport_address": "172.16.87.60:9300",
"host": "172.16.87.60",
"ip": [
"172.16.87.60:9300",
"NONE"
],
"attributes": {
"master": "true"
},
"process": {
"timestamp": 1498074992865,
"open_file_descriptors": 62974,
"max_file_descriptors": 65535,
"cpu": {
"percent": 22,
"total_in_millis": 30145070
},
"mem": {
"total_virtual_in_bytes": 131429810176
}
}
},
"uu1xA_1qTsaChCfyCm11Ig": {
"timestamp": 1498074994107,
"name": "elastic-gui",
"transport_address": "172.16.87.64:9301",
"host": "172.16.87.64",
"ip": [
"172.16.87.64:9301",
"NONE"
],
"attributes": {
"data": "false",
"master": "false"
},
"process": {
"timestamp": 1498074994107,
"open_file_descriptors": 380,
"max_file_descriptors": 65535,
"cpu": {
"percent": 4,
"total_in_millis": 6583920
},
"mem": {
"total_virtual_in_bytes": 10332991488
}
}
},
"pSWuGFrOQPCvX4ykrzcV0Q": {
"timestamp": 1498074992869,
"name": "elastic-2",
"transport_address": "172.16.87.58:9300",
"host": "172.16.87.58",
"ip": [
"172.16.87.58:9300",
"NONE"
],
"attributes": {
"master": "true"
},
"process": {
"timestamp": 1498074992869,
"open_file_descriptors": 65425,
"max_file_descriptors": 65535,
"cpu": {
"percent": 27,
"total_in_millis": 41594260
},
"mem": {
"total_virtual_in_bytes": 135810736128
}
}
},
"jXLTdv8ISEmEEHMUJWSnWQ": {
"timestamp": 1498074992869,
"name": "elastic-1",
"transport_address": "172.16.87.59:9300",
"host": "172.16.87.59",
"ip": [
"172.16.87.59:9300",
"NONE"
],
"attributes": {
"master": "true"
},
"process": {
"timestamp": 1498074992869,
"open_file_descriptors": 47658,
"max_file_descriptors": 65535,
"cpu": {
"percent": 22,
"total_in_millis": 40894600
},
"mem": {
"total_virtual_in_bytes": 112791838720
}
}
}
}
}
My ES cluster health:
{
"cluster_name": "reportes",
"status": "red",
"timed_out": false,
"number_of_nodes": 4,
"number_of_data_nodes": 3,
"active_primary_shards": 8334,
"active_shards": 8334,
"relocating_shards": 0,
"initializing_shards": 12,
"unassigned_shards": 8326,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 291067,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 341159,
"active_shards_percent_as_number": 49.9880038387716
}