Elasticsearch cluster instability

Sorry for the lack of information here, but we have been experiencing massive cluster instability for the last week or so. We had upgraded to 1.5.2 last week and were stable for a full week before we started to have issues. It looked like the master node would lose connectivity with a data node, as we were seeing TimeOut exceptions in our logs. Restarting the node wouldn't fix this; when we powered the node off the cluster still thought it was joined. The only way to fix this was to completely stop every node in the cluster and restart the cluster. The Cluster would rebuild fine, and then this would happen again. Whenever it did, EVERY node would stop responding to REST API called. Any API call I would do that wasn't just 'localhost:9200' or 'localhost:9200/_cluster/health' would just hang and never respond (which also means no searching, plugins won't work, etc).

Cluster starts:

Nodes: 11
Shards: 15861
Size: 2 TB
Master: 4 nodes
Client: 4 nodes
Data: 3 nodes

Things I've tried:

  1. Moving all nodes to the same network
  2. Reducing the number of shards in the cluster
  3. Downgrading to 1.5.0 (When I tried starting a 1.5.0 node, I got a Java stacktrace: CorruptStateException: State version mismatch expected: )

This is extremely frustrating, and has been impacting production for days now!

15861 for 11 nodes, looks like awefully a lot. you mentioned you reducing number of shards in the cluster, how much have you reduce (not sure shard can be alter after it is created) ? when you see timeout, did you check on timeout setting and alter accordingly?

agreed with @Jason_Wee. The # of shards looks pretty high... What is ur master node configuration?

Thanks all. Right now it is 15k shards for 3 data nodes; 5 primary shards per index with two replicas. The following is my master node config:

## Node Information
node.data: false
node.master: true
node.name: "<%= @hostname %>-node1"
node.rack: devqa-hp360-physical
path.data: /app/elasticsearch-node1
http.port: 9200
http.cors.allow-origin: "/.*/"
http.cors.enabled: true
cluster.name: <%= scope.lookupvar("elasticsearch::cluster_name") %>

discovery.zen.ping.unicast.hosts: <%= scope.lookupvar("elasticsearch::node_addresses") %>

network.bind_host: _eth2:ipv4_
network.publish_host: _eth2:ipv4_

bootstrap.mlockall: true
gateway.recover_after_nodes: 3
index.number_of_replicas: 2
index.merge.scheduler.max_thread_count: 1

## Discovery Information
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false

action.disable_delete_all_indices: true

script.groovy.sandbox.enabled: false
## Threadpool Settings ##

# Search pool
threadpool.search.type: cached
threadpool.search.size: 60
threadpool.search.queue_size: 600

# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 80
threadpool.bulk.queue_size: 600

# Index pool
threadpool.index.type: fixed
threadpool.index.size: 60
threadpool.index.queue_size: -1

# Indices settings
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

# Cache Sizes
indices.fielddata.cache.size: 75%
indices.fielddata.cache.expire: 6h
indices.cache.filter.size: 15%
indices.cache.filter.expire: 6h
indices.breaker.fielddata.limit: 85%

# Indexing Settings for Writes
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

indices.recovery.max_bytes_per_sec: 5000mb

By master node config, I meant the RAM/Processor. Anyways, for ur data we had similar master bottleneck when we had 36k shards, the master was 2 core 5 GB. We moved to 4 cores 28 GB and things ran smoothly for even more shards than what we wanted to support.

~15000 shards for 2TB of data is extremely excessive. Is that just your primary shard count? Because with number_of_replicas set to 2, you've actually got around 45000 shards. Keep in mind each shard is an Lucene process, which consumes resources on each node...

I'd suggest you drastically drop your number of shards. For 2TB of data, around 40 (primary) shards across all indices is probably much better.

No, that is ~15000 shards total; I have about ~5000 primary shards.

The reason we have so many shards is that we use logstash for all of our indexing, and that creates a new index a day.

So right now I am noticing that the /_nodes API can sometimes take forever to return a value. This is causing issues with Kibana, because it looks like KIbana is also making a call to the /_nodes API. I suspect that there is an issue with some threadpool somewhere, but for the life of me I can't figure out which one it is.

You can change the number of shards instead of using the default 5 shards.
Also, you can create an index per month instead of per day if your volume of docs per day is not that big.

So shard distribution with Elasticsearch I get a little confused on 'Best Practices'. The default number of primary shards is 5, so what are the pros and cons on changing the default number? Also, is the high number of shards causing my /_nodes /_search and /_msearch API calls to take a long time to return? Because right now basic searches are taking ~30 seconds to complete.

Optimally you want to only have exactly 1 shard per node per index if you can. So say you have 3 data nodes, then you want to only have 1 primary shard with 2 replicas. This will give you optimal performance.

sometimes having just 1 shard per node might not be ideal especially for machines with more than 1 core. a lot of things in Lucene happens sequentially especially how it processed segments. if data is spread (large data) into more than one shard, one ES query can result in multiple sub-queries (Lucene queries) and be handled in parallel... By more than 1 shard I mean some shards (single-digit number) :smile:

I still don't have an answer as to why /_nodes, /_cat, /_search, /_msearch, etc (but not /_cluster) sometimes take FOREVER to return any information (30 seconds to 90 seconds per query).

[root@esearchctrl1-ilg1 ~]# time curl -XGET 'http://10.241.149.93:9200/_cat/thread_pool?v'
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
esearchctrl1-ilg1.colo-bo.ilg1.vrsn.com 10.241.157.93 0 0 0 0 0 0 0 0 0
esearch2-ilg1.colo-bo.ilg1.vrsn.com 10.241.156.78 2 2 10 0 0 0 0 0 0
esearch3-ilg1.colo-bo.ilg1.vrsn.com 10.241.156.79 5 0 0 0 0 0 0 0 0
esearchctrl2-ilg1.colo-bo.ilg1.vrsn.com 10.241.157.94 0 0 0 0 0 0 0 0 0
esearchctrl1-ilg1.colo-bo.ilg1.vrsn.com 10.241.157.93 0 0 0 0 0 0 0 0 0
esearchctrl2-ilg1.colo-bo.ilg1.vrsn.com 10.241.157.94 0 0 0 0 0 0 0 0 0
esearch1-ilg1.colo-bo.ilg1.vrsn.com 10.241.156.77 3 0 0 0 0 0 0 0 0

real 0m50.050s
user 0m0.001s
sys 0m0.004s

50 Seconds to return that info, and no threads are going out of control.