Distributed single points of failure

We have a few high traffic ES clusters running in a production environment.
Our setup varies slightly from location to location but is generally as follows:

3 master nodes w/ 4 GB ram
4 client nodes behind a round robin load balancer with 64GB ram
18 data nodes w/ 8 x SSDs with 256GB ram

We are running Elasticsearch 2.4.1. All nodes have 30.5GB heap.

In the last couple of months we've had a few issues related to a yet-to-be identified surge in heap usage, on one node, triggering it to become unresponsive.

Our shard distribution is such that most queries end up hitting every data node. Which means that whenever a node is unresponsive, our cluster becomes unresponsive as well. So what we have our our hands is a nice setup with a distributed single point of failures cluster.

This begs the question: With a replica factor of 2 for all indices, would it not be possible to be smarter about this -- have ES query all three shards at the same time and use whichever one is fastest ?

During normal operation, our used heap hovers around 20GB so the committed 30.5GB should suffice. Yes, we need to figure out the issue on our end but it would be nice if the cluster could be better at safeguarding us from our own misconfigurations.

Do you have Marvel (Monitoring) installed to see why this surge may be happening?

No marvel in production. We're using it in our test environment but haven't gotten the go-ahead for paying the $$$$ needed for using it in production. Note that my main motivation for making the post is the distributed single point of failure part.

As our cluster grows we're sort of always expecting that the heap usage will sneak up on us. It would be nice if this wouldn't cause such catastrophic impact. We're actively monitoring GC logs to catch early signs of heap pressure, but the sudden surges aren't caught by this.

Marvel is free :slight_smile:

Was not aware of that :slight_smile: Thought it was $500 per server per year or something.
Regardless, care to comment on the main underlying comment about the distributed single points of failure ?

We aim to achieve HA, but regardless of the stability of our own setup or Elasticsearch itself, a situation where one node becomes slow or temporary unreachable is inevitable.

It used to be, but we made it free!

One single node having issues shouldn't cause issues, can you share the config you have set?

Here's the running process

/usr/lib/jvm/oracle-jdk8-x86_64/bin/java -Xms30720m -Xmx30720m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -Xloggc:/var/log/elasticsearch/gc.log -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.4.1.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -d -p /var/run/elasticsearch/elasticsearch.pid --default.path.home=/usr/share/elasticsearch --default.path.logs=/data/elasticsearch/log --default.path.data=/data/elasticsearch --default.path.conf=/etc/elasticsearch

Here's /etc/default/elasticsearch

ES_JAVA_OPTS="-XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -Xloggc:/var/log/elasticsearch/gc.log"

and here's elasticsearch.yml

cluster.name: productionv2
path.data: /data/elasticsearch
node.name: esdata1
node.master: false
node.type: primary
discovery.zen.ping.multicast.enabled: false
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["esmaster1", "esmaster2", "esmaster3"]
gateway_expected_nodes: 3
gateway.recover_after_nodes: 2
gateway.local.auto_import_dangled: no
gateway.recover_after_time: 3m
bootstrap.mlockall: true
action.auto_create_index: false
action.destructive_requires_name: true
script.inline: true
script.indexed: true

http.enabled: true
http.port: 9200
http.publish_port: 9200
http.cors.allow-origin: "*"
http.cors.enabled: true
cluster.routing.allocation.cluster_concurrent_rebalance: 16
cluster.routing.allocation.node_concurrent_recoveries: 16
index.number_of_replicas: 2
index.query.bool.max_clause_count: 10240
  - esdata1
  - localhost
indices.recovery.concurrent_streams: 16
indices.store.throttle.max_bytes_per_sec: 200mb
indices.recovery.throttle.max_bytes_per_sec: 200mb
index.max_result_window: 300000

Note that we have since bumped heap to 128g while we're figuring out what's going on.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.