Elasticsearch cluster status RED and not responding


(peshu) #1

Hi friends,

My Elastic search cluster is not responding and status is RED from last few days. It is not able to allocate the shards which are unassigned. This issue started when I first restart the primary instance and I am also not able to delete the shards which are unassigned. Can you please help

No of nodes in the cluster
172.16.112.112 172.16.112.112 62 69 0.27 d m elastic-search-60586867-3-183395336.major-qa.graylog.dfwqa2.qa.com
172.16.118.13 172.16.118.13 46 87 0.00 c - graylog-c4884382-571a-4490-90e5-01a125578189
172.16.114.239 172.16.114.239 64 80 1.09 d m elastic-search-60586867-1-183395330.major-qa.graylog.dfwqa2.qa.com
172.16.118.172 172.16.118.172 65 89 0.00 c - graylog-ae9e0ce7-07d7-44d5-ad65-2e01d842f314
172.16.115.216 172.16.115.216 82 59 1.15 d m elastic-search-60586867-2-183395333.major-qa.graylog.dfwqa2.qa.com
172.16.118.17 172.16.118.17 32 86 0.22 c - graylog-82a8d493-012a-4894-b749-3b2ac276b708
172.16.115.233 172.16.115.233 40 68 0.62 d m elastic-search-60586867-4-190299637.major-qa.graylog.dfwqa2.qa..com
172.16.113.146 172.16.113.146 49 67 1.13 d * elastic-search-60586867-5-190299640.major-qa.graylog.dfwqa2.qa.com
172.16.118.14 172.16.118.14 25 86 0.17 c - graylog-e400e5f3-9f72-41c8-8ef3-b57ba5270693

and here is the status I can see
{
"cluster_name" : "elasticsearch-major-qa",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 5,
"active_primary_shards" : 14718,
"active_shards" : 29436,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8220,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 78.17080943275973
}


(David Pilato) #2

So? 37 656 Shards on 5 nodes?
Around 7500 shards per node!

That's like running 7500 MySQL instances on 1 machine. Would you really do that?

You must reduce that number or increase the number of nodes.

BTW you didn't tell the version and hardware details.


(peshu) #3

I am using 2.4.1 version Elasticsearch on CentOS 6.7 with 130GB disk and 16GB Memory on each node

Yes I have 37K shards on 5nodes, we are using OneOps to provision the env and config.

Can you please point me how to reduce them, I am ok if I have to loose any data or all the data. I haven't configured anywhere to have so many shards, I want to keep this minimum. I am using file beat to ship the logs to Elasticsearch regularly.


(David Pilato) #4

Change the number of shards to 1 by index, have a lesser period of retention, use rollover API instead... Many choices. May be a combination of all that?

So you have only 8gb of HEAP, right? Might be not enough for so many shards.

BTW about your specific problem, do you see any errors in logs?


(peshu) #5

I don't see any errors in the logs, when I restart complete cluster - it starts allocating all the 32K shards and stops allocating a some point without any errors.

When I tried to delete few by using below CURl, its not helping.
$ curl -XDELETE 'http://localhost:9200/graylog_57529/'
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [graylog_57529]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (delete-index [graylog_57529]) within 30s"},"status":503}

$ curl -XDELETE 'localhost:9200/graylog_57286?pretty'
{
"acknowledged" : false
}
[app@elastic-search-60586867-5-190299640 ~]$ curl -XDELETE 'localhost:9200/graylog_57287?pretty'
{
"error" : {
"root_cause" : [ {
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (delete-index [graylog_57287]) within 30s"
} ],
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (delete-index [graylog_57287]) within 30s"
},
"status" : 503
}

And if I delete indexes manually from the directory and restart the cluster, it is restoring the deleted ones I guess from replica. Can you please suggest how to deleted the indexes using CURL or manually so that I can bring up the cluster and then work on ES management.


(David Pilato) #6

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Can you try this: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/cluster-allocation-explain.html


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.