We had to hard shutdown all of our Elasticsearch server due to an environmental issue. Now all of our shards are unassigned:
{
"cluster_name" : "Elasticsearch-Cluster-1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 101,
"active_shards" : 101,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 109,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 48.095238095238095
}
logstash-2016.04.27 1 r UNASSIGNED
logstash-2016.04.27 0 r UNASSIGNED
logstash-2016.04.28 2 p UNASSIGNED
logstash-2016.04.28 2 r UNASSIGNED
logstash-2016.04.28 1 p UNASSIGNED
logstash-2016.04.28 1 r UNASSIGNED
logstash-2016.04.28 0 p UNASSIGNED
logstash-2016.04.28 0 r UNASSIGNED
.marvel-es-2016.04.28 0 p UNASSIGNED
.marvel-es-2016.04.28 0 r UNASSIGNED
.marvel-es-data-1 0 r UNASSIGNED
.marvel-es-2016.04.25 0 r UNASSIGNED
.marvel-es-2016.04.24 0 r UNASSIGNED
.marvel-es-2016.04.27 0 r UNASSIGNED
.marvel-es-2016.04.26 0 r UNASSIGNED
logstash-2016.03.29 2 r UNASSIGNED
logstash-2016.03.29 1 r UNASSIGNED
logstash-2016.03.29 0 r UNASSIGNED
logstash-2016.03.28 2 r UNASSIGNED
logstash-2016.03.28 1 r UNASSIGNED
logstash-2016.03.28 0 r UNASSIGNED
.marvel-es-1-2016.04.28 0 r UNASSIGNED
.marvel-es-2016.03.30 0 r UNASSIGNED
.marvel-es-2016.03.31 0 r UNASSIGNED
.marvel-es-2016.04.01 0 r UNASSIGNED
.marvel-es-2016.03.28 0 r UNASSIGNED
.marvel-es-2016.03.29 0 r UNASSIGNED
logstash-2016.04.01 2 r UNASSIGNED
logstash-2016.04.01 1 r UNASSIGNED
logstash-2016.04.01 0 r UNASSIGNED
logstash-2016.04.02 2 r UNASSIGNED
logstash-2016.04.02 1 r UNASSIGNED
logstash-2016.04.02 0 r UNASSIGNED
logstash-2016.03.31 2 r UNASSIGNED
logstash-2016.03.31 1 r UNASSIGNED
logstash-2016.03.31 0 r UNASSIGNED
logstash-2016.04.03 2 r UNASSIGNED
logstash-2016.04.03 1 r UNASSIGNED
logstash-2016.04.03 0 r UNASSIGNED
logstash-2016.03.30 2 r UNASSIGNED
logstash-2016.03.30 1 r UNASSIGNED
logstash-2016.03.30 0 r UNASSIGNED
logstash-2016.04.04 2 r UNASSIGNED
logstash-2016.04.04 1 r UNASSIGNED
logstash-2016.04.04 0 r UNASSIGNED
logstash-2016.04.09 2 r UNASSIGNED
logstash-2016.04.09 1 r UNASSIGNED
logstash-2016.04.09 0 r UNASSIGNED
logstash-2016.04.05 2 r UNASSIGNED
logstash-2016.04.05 1 r UNASSIGNED
logstash-2016.04.05 0 r UNASSIGNED
logstash-2016.04.06 2 r UNASSIGNED
logstash-2016.04.06 1 r UNASSIGNED
logstash-2016.04.06 0 r UNASSIGNED
logstash-2016.04.07 2 r UNASSIGNED
logstash-2016.04.07 1 r UNASSIGNED
logstash-2016.04.07 0 r UNASSIGNED
logstash-2016.04.08 2 r UNASSIGNED
logstash-2016.04.08 1 r UNASSIGNED
logstash-2016.04.08 0 r UNASSIGNED
.marvel-es-2016.04.10 0 r UNASSIGNED
.marvel-es-2016.04.12 0 r UNASSIGNED
.marvel-es-2016.04.11 0 r UNASSIGNED
.marvel-es-2016.04.07 0 r UNASSIGNED
.marvel-es-2016.04.06 0 r UNASSIGNED
.marvel-es-2016.04.09 0 r UNASSIGNED
.marvel-es-2016.04.08 0 r UNASSIGNED
.kibana 0 r UNASSIGNED
.marvel-es-2016.04.03 0 r UNASSIGNED
.marvel-es-2016.04.02 0 r UNASSIGNED
.marvel-es-2016.04.05 0 r UNASSIGNED
.marvel-es-2016.04.04 0 r UNASSIGNED
logstash-2016.04.12 2 r UNASSIGNED
logstash-2016.04.12 1 r UNASSIGNED
logstash-2016.04.12 0 r UNASSIGNED
logstash-2016.04.13 2 r UNASSIGNED
logstash-2016.04.13 1 r UNASSIGNED
logstash-2016.04.13 0 r UNASSIGNED
logstash-2016.04.14 2 r UNASSIGNED
logstash-2016.04.14 1 r UNASSIGNED
logstash-2016.04.14 0 r UNASSIGNED
Anyone have any suggestions on how to best bring them back into the cluster?
Hi! Look at data and active master logs are any error or warnings about shards?
Another point, look at fs level on nodes if there shard's files or shards directory just empty?
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[p-es-1][192.168.56.86:9300][cluster:admin/reroute]"}],"type":"illegal_argument_exception","reason":"[allocate] allocation of [logstash-2016.04.20][2] on node {p-es-1}{uUw7qMvGSCqyILY1IbT9NQ}{192.168.56.86}{192.168.56.86:9300} is not allowed, reason: [YES(target node version [2.3.2] is same or newer than source node version [2.3.2])][NO(shard cannot be allocated on same node [uUw7qMvGSCqyILY1IbT9NQ] it already exists on)][YES(shard not primary or relocation disabled)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][NO(more than allowed [85.0%] used disk on node, free: [14.640832792774445%])][YES(below shard recovery limit of [2])][YES(primary is already active)][YES(no allocation awareness enabled)][YES(node passes include/exclude/require filters)][YES(allocation disabling is ignored)][YES(allocation disabling is ignored)]"},"status":400}
The line that sticks out is: "shard cannot be allocated on same node [uUw7qMvGSCqyILY1IbT9NQ] it already exists on". I tried putting it on all of our hosts and ended up with the same error.
What is the output of df -h ? As you can see there are percentage in message not absolute numbers, so no matter how much you have in compare to shard size, it's just less then 15% of free space.
In fact, if you are know data rate, you can completely disable: cluster.routing.allocation.disk.threshold_enabled
or change cluster.routing.allocation.disk.watermark.low and cluster.routing.allocation.disk.watermark.high to higher percent values or to an absolute byte value (like 900mb).
I actually didnt make any changes to the cluster configuration. I didn't realize that when the disk space gets below 15% shard allocation stops completely. In our case we still had several 100Gb free, but it was below 15%. So now I just monitor disk space a little closer and make sure indexes are being deleted. Once I rebuild the cluster in production I may set an absolute value based on our disk space.
In theory as long as your data is on a different file-system, it probably wont make a difference. I think the setting is only in place to protect you from filling up you root file-system and halting your system.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.