Hello, when I have an Elasticsearch cluster with nothing but .kibana and .marvel indicies in it, my consumer reading in the files will periodically encounter exceptions and close, due to the fact that ES is closing the connection
Packrat is the consumer I have that reads documents off RabbitMQ and into Elasticsearch search. When running this program under supervision, it will restart and grind through this process until the maximum number of indices needed are created and then everything runs well. I've found some other issues that seem to be related, but I'd like to note that I'm only creating 1 index for each month in a year and I'm also only allocating 5 shards for each date before 2000 and 25 for each date after 2000. I've attached the seemingly related issues below:
The ES cluster is 16 nodes, with 1 dedicated master, 2 dedicated clients and the rest are data nodes. The servers they are running on are all 8 core intel xeon e5 CPUs and 64GM of ram.
To answer your questions:
[2017-10-24 17:15:57,020][ERROR][marvel.agent ] [this.is.the.server.name] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
: index [.marvel-es-1-2017.10.24], type [node_stats], id [AV9QPQR2OX3ZaVlWAb_2], message [UnavailableShardsException[[.marvel-es-1-2017.10.24] primary shar
d is not active Timeout: [1m], request: [BulkShardRequest to [.marvel-es-1-2017.10.24] containing  requests]]]];
... 3 more
Caused by: ElasticsearchException[failure in bulk execution:
: index [.marvel-es-1-2017.10.24], type [node_stats], id [AV9QPQR2OX3ZaVlWAb_2], message [UnavailableShardsException[[.marvel-es-1-2017.10.24] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.marvel-es-1-2017.10.24] containing  requests]]]]
... 3 more
Also there are so many more shards for the dates after 2000 because that's the lion's share of the data. The indices get created pertaining to the date the document was generated. an example of an index would be sf_ab_doc__1998_03__v2.1 . Right now in a testing ES cluster, we have about 606 indices with a few million documents and about 6500 shards.
The indices are stored in cluster state. The shards are stored in a routing table, which is also in the cluster state. Every time a shard changes state (when created, started, recovered, initialised, moved) it needs to update the routing table and thus cluster state. And in 2.X the entire cluster state was updated and then sent to all other nodes in the cluster, whereas in 5.X we send just the changes that are made which is way more efficient.
Cluster state updates are generally done single-threaded in order to ensure consistency. The changes are then propagated to all the nodes in the cluster. In addition to adding or altering indices and changing the location and state of shards, changes to mappings also require the cluster state to be updated. If you therefore are using dynamic mappings and have frequent changes/additions, this will result in an even larger number of cluster state changes.