Cluster breaks very frequently

I have a 3-node cluster, all are master eligible. I am facing a weird issue of cluster breaks since last month and it's very frequent. I've tried a lot of things to make it stable but nothing works. I am using ES 5.6.16 and in my case, the GC runs very frequently.

[2022-11-01T10:54:52,911][WARN ][o.e.m.j.JvmGcMonitorService] [node-2] [gc][780755] overhead, spent [1.4m] collecting in the last [1.4m]

I've tried tweaking the cluster setting and force merge as well but nothing works out for me.

Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:598) ~[elasticsearch-5.6.16.jar:5.6.16]
	... 14 more```

You are using a very old version that has been EOL a long time, so I recommend you upgrade ASAP.

What is the full output of the cluster stats API?

@Christian_Dahlqvist, thanks for the reply, sure we are in the transition phase but we need to support the legacy system as well for some time. Please find below the cluster statistics.

{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "elastic-search",
  "timestamp": 1667364285490,
  "status": "green",
  "indices": {
    "count": 40,
    "shards": {
      "total": 902,
      "primaries": 304,
      "replication": 1.9671052631578947,
      "index": {
        "shards": {
          "min": 10,
          "max": 27,
          "avg": 22.55
        },
        "primaries": {
          "min": 5,
          "max": 9,
          "avg": 7.6
        },
        "replication": {
          "min": 1.0,
          "max": 2.0,
          "avg": 1.95
        }
      }
    },
    "docs": {
      "count": 10457899,
      "deleted": 893903
    },
    "store": {
      "size_in_bytes": 1197975495169,
      "throttle_time_in_millis": 0
    },
    "fielddata": {
      "memory_size_in_bytes": 20345744,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 48421530,
      "total_count": 1531139351,
      "hit_count": 104916005,
      "miss_count": 1426223346,
      "cache_size": 50064,
      "cache_count": 6584989,
      "evictions": 6534925
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 8784,
      "memory_in_bytes": 6106210040,
      "terms_memory_in_bytes": 5848381731,
      "stored_fields_memory_in_bytes": 38513272,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 115063936,
      "points_memory_in_bytes": 4162781,
      "doc_values_memory_in_bytes": 100088320,
      "index_writer_memory_in_bytes": 168254915,
      "version_map_memory_in_bytes": 3228,
      "fixed_bit_set_memory_in_bytes": 891680,
      "max_unsafe_auto_id_timestamp": 1667300841619,
      "file_sizes": {

      }
    }
  },
  "nodes": {
    "count": {
      "total": 3,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 3
    },
    "versions": [
      "5.6.16"
    ],
    "os": {
      "available_processors": 96,
      "allocated_processors": 96,
      "names": [
        {
          "name": "Linux",
          "count": 3
        }
      ],
      "mem": {
        "total_in_bytes": 200153673728,
        "free_in_bytes": 1518215168,
        "used_in_bytes": 198635458560,
        "free_percent": 1,
        "used_percent": 99
      }
    },
    "process": {
      "cpu": {
        "percent": 9
      },
      "open_file_descriptors": {
        "min": 1392,
        "max": 1416,
        "avg": 1405
      }
    },
    "jvm": {
      "max_uptime_in_millis": 63879758,
      "versions": [
        {
          "version": "1.8.0_201",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.201-b08",
          "vm_vendor": "Oracle Corporation",
          "count": 3
        }
      ],
      "mem": {
        "heap_used_in_bytes": 61050643808,
        "heap_max_in_bytes": 96034947072
      },
      "threads": 845
    },
    "fs": {
      "total_in_bytes": 2497941516288,
      "free_in_bytes": 1249017913344,
      "available_in_bytes": 1248967581696,
      "spins": "true"
    },
    "plugins": [
      {
        "name": "analysis-kuromoji",
        "version": "5.6.16",
        "description": "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
        "has_native_controller": false
      },
      {
        "name": "elasticsearch-analysis-openkoreantext",
        "version": "1.0.0",
        "description": "Korean analysis plugin integrates open-korean-text module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.openkoreantext.AnalysisOpenKoreanTextPlugin",
        "has_native_controller": false
      },
      {
        "name": "analysis-smartcn",
        "version": "5.6.16",
        "description": "Smart Chinese Analysis plugin integrates Lucene Smart Chinese analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.smartcn.AnalysisSmartChinesePlugin",
        "has_native_controller": false
      },
      {
        "name": "analysis-stempel",
        "version": "5.6.16",
        "description": "The Stempel (Polish) Analysis plugin integrates Lucene stempel (polish) analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.stempel.AnalysisStempelPlugin",
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "netty4": 3
      },
      "http_types": {
        "netty4": 3
      }
    }
  }
}

As you are using a very old version there is some information missing from the stats that are available in newer versions, and I have not used this version in years. Overall I think the stats look OK and there is not much that jumps out, although it looks like you may have quite large documents in your indices. As you apparently are suffering from heap pressure and long and frequent GC it is possible that this may be a contributing factor. As we do not know anything about your use case, data or queries it is hard to tell.

Thanks, @Christian_Dahlqvist really appreciate your help. Yes, we have quite large documents. If I configure 3 dedicated master nodes does this prevent the cluster break issue?

Impossible to tell as I do not know exactly what the issue is. It may help with stability, but if it is the data nodes that are suffering from GC issues it may not change things much.

Ok, is it recommended to update the Elasticsearch queue size?

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@503a4548 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1ad7ca35[Running, pool size = 49, active threads = 49, queued tasks = 4149, completed tasks = 42552493]]

and what about increasing the ping time out for the transport layer?

Increasing the queue size would likely just keep more in memory and put additional pressure on GC. Increasing ping time does not address the problem. You need to look at your queries and data to see what is driving heap usage.

One thing you could try is to reduce the number of replicas from 2 to 1, as that would reduce the amount of data on each node. This could potentially be an easy way to reduce heap usage.

I would recommend you upgrade to the latest version of Elasticsearch as a lot of improvements have been made since the version you are using came out. There is also not a lot of people with recent experience in your version that may be able to help troubleshoot something like this.

Please upgrade, 5.x is so super old and very much EOL and there has been a tonne of improvements around cluster resilience and stability that will help your situation.

Thanks a lot, @Christian_Dahlqvist for your help and suggestions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.