Pending tasks queue

Hi Team,

Below os my cluster health and I am not able to ingest any logs into ES because as soon as Logstash consumer starts to use Auto bulk api of elasticsearch. Documents are not getting indexed and my pending tasks queue is piling up. I am not able to make my cluster stable.

curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "galaxy",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 33342,
  "active_shards" : 53879,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 12801,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3334,
  "number_of_in_flight_fetch" : 0
}

pending tasks on my master

{"mon":{"what":"pending_tasks","queue_size":3490,"HIGH":3442,"NORMAL":44,"URGENT":4,"master":{"pri":"NORMAL","cnt":44,"avgT":3.30664e+06,"minT":2345897,"minInOrder":6060},"shard-started":{"pri":"URGENT","cnt":4,"avgT":13820.5,"minT":13687,"minInOrder":10473},"_add_listener_":{"pri":"HIGH","cnt":9,"avgT":3.7101e+06,"minT":2583947,"minInOrder":6077},"refresh-mapping":{"pri":"HIGH","cnt":3432,"avgT":1.89575e+06,"minT":174,"minInOrder":10001},"cluster_reroute":{"pri":"HIGH","cnt":1,"avgT":3859580,"minT":3859580,"minInOrder":6035}}}

Can you please help me to get my cluster stable.

What version of Elasticsearch is this? 1.x? If so the answer is simple: upgrade to 2.x. You have a ton shards in the cluster. In 2.0 Elasticsearch learned how to send the cluster state updates out by diff-ing the new and old cluster state. Before that it sent the whole cluster state out after every action.

Reducing the number of shards will certainly help. Things like creating new indexes with fewer shards and deleting old indexes are the way to go for this.

Nik's being nice :wink:
You have a ridiculously massive amount of shards, and that is the cause of the problem. Upgrade as he mentions, reduce your shard count too.

Thanks @warkolm @nik9000 I have been doing cleaning activity since past 2 days and old indices are almost cleared.

I am using "version" : {
"number" : "1.7.2" }

My current cluster state is:--

{
"cluster_name" : "galaxy",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 50,
"number_of_data_nodes" : 6,
"active_primary_shards" : 9352,
"active_shards" : 17151,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1553,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1,
"number_of_in_flight_fetch" : 0
}

Are the shards fine now. I will also upgrade my version to 2.x but this will introduce bump of my logstash version too.

For new indices I will reduce the primary shards to 3 and replicas to 1. Can you please guide me through how many shards should be there on data nodes:--

e.g. I have 6 data nodes of r3.2xlarge aws. How many shards a particular node can accomodate

Tip: wrap code-like stuff in ``` so it is easier to read.

In 2.x we don't really know "how many shard copies is too many". In 1.7 I'd say about 2000 is pushing it, especially given the number of nodes you need to effectively run that many shard copies. We also don't have a hard and fast rule for the number of shards a node can run. It depends on the size and what features you enable. And what version as well. Newer versions are almost always better on all combinations of features though.

Good work slashing the size down! 2.x probably would have been much more stable, but, yeah, that means a major upgrade.

Thanks @nik9000 for being nice :wink: and thanks @warkolm too. I have reduced the shards to now .

  "cluster_name" : "galaxy",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 56,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 2254,
  "active_shards" : 4219,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 289,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}```

Cluster seems to be stable now. I will also go for the upgrade even if it is still a pain as broadcasting whole cluster state with these many shards is too harmful. Thanks all for your help

Great stuff!

Thanks @warkolm. Now I am facing another issue with my cluster. I can see ingestion rate on my cluster is good as it is half of the publish rate in my rabbitmq but my queue size is around 80M. I can also see that my replicas are in unassigned state since long and they are not initializing due to which when kibana is querying shards I am getting SearchPhraseException.

Aside to this Logs are not getting indexed properly. Can you guys please help if this is due to Dynamic Mapping which is causing issues with indexing as dynamic mapping is used as of now.

xxxx-nginx-logs-2016.10.04 1 r UNASSIGNED
xxxx-nginx-logs-2016.10.04 4 r UNASSIGNED
xxxx-nginx-logs-2016.10.03 3 r UNASSIGNED
xxxx-nginx-error-logs-2016.10.04 2 r UNASSIGNED

  "cluster_name" : "galaxy",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 34,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 2752,
  "active_shards" : 5207,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 297,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}```