Cluster breaks very frequently

Ankur_Mahajan · November 2, 2022, 10:43am

I have a 3-node cluster, all are master eligible. I am facing a weird issue of cluster breaks since last month and it's very frequent. I've tried a lot of things to make it stable but nothing works. I am using ES 5.6.16 and in my case, the GC runs very frequently.

[2022-11-01T10:54:52,911][WARN ][o.e.m.j.JvmGcMonitorService] [node-2] [gc][780755] overhead, spent [1.4m] collecting in the last [1.4m]

I've tried tweaking the cluster setting and force merge as well but nothing works out for me.

Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:598) ~[elasticsearch-5.6.16.jar:5.6.16]
	... 14 more```

Christian_Dahlqvist · November 2, 2022, 10:57am

You are using a very old version that has been EOL a long time, so I recommend you upgrade ASAP.

What is the full output of the cluster stats API?

Ankur_Mahajan · November 2, 2022, 11:03am

@Christian_Dahlqvist, thanks for the reply, sure we are in the transition phase but we need to support the legacy system as well for some time. Please find below the cluster statistics.

{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "elastic-search",
  "timestamp": 1667364285490,
  "status": "green",
  "indices": {
    "count": 40,
    "shards": {
      "total": 902,
      "primaries": 304,
      "replication": 1.9671052631578947,
      "index": {
        "shards": {
          "min": 10,
          "max": 27,
          "avg": 22.55
        },
        "primaries": {
          "min": 5,
          "max": 9,
          "avg": 7.6
        },
        "replication": {
          "min": 1.0,
          "max": 2.0,
          "avg": 1.95
        }
      }
    },
    "docs": {
      "count": 10457899,
      "deleted": 893903
    },
    "store": {
      "size_in_bytes": 1197975495169,
      "throttle_time_in_millis": 0
    },
    "fielddata": {
      "memory_size_in_bytes": 20345744,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 48421530,
      "total_count": 1531139351,
      "hit_count": 104916005,
      "miss_count": 1426223346,
      "cache_size": 50064,
      "cache_count": 6584989,
      "evictions": 6534925
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 8784,
      "memory_in_bytes": 6106210040,
      "terms_memory_in_bytes": 5848381731,
      "stored_fields_memory_in_bytes": 38513272,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 115063936,
      "points_memory_in_bytes": 4162781,
      "doc_values_memory_in_bytes": 100088320,
      "index_writer_memory_in_bytes": 168254915,
      "version_map_memory_in_bytes": 3228,
      "fixed_bit_set_memory_in_bytes": 891680,
      "max_unsafe_auto_id_timestamp": 1667300841619,
      "file_sizes": {

      }
    }
  },
  "nodes": {
    "count": {
      "total": 3,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 3
    },
    "versions": [
      "5.6.16"
    ],
    "os": {
      "available_processors": 96,
      "allocated_processors": 96,
      "names": [
        {
          "name": "Linux",
          "count": 3
        }
      ],
      "mem": {
        "total_in_bytes": 200153673728,
        "free_in_bytes": 1518215168,
        "used_in_bytes": 198635458560,
        "free_percent": 1,
        "used_percent": 99
      }
    },
    "process": {
      "cpu": {
        "percent": 9
      },
      "open_file_descriptors": {
        "min": 1392,
        "max": 1416,
        "avg": 1405
      }
    },
    "jvm": {
      "max_uptime_in_millis": 63879758,
      "versions": [
        {
          "version": "1.8.0_201",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.201-b08",
          "vm_vendor": "Oracle Corporation",
          "count": 3
        }
      ],
      "mem": {
        "heap_used_in_bytes": 61050643808,
        "heap_max_in_bytes": 96034947072
      },
      "threads": 845
    },
    "fs": {
      "total_in_bytes": 2497941516288,
      "free_in_bytes": 1249017913344,
      "available_in_bytes": 1248967581696,
      "spins": "true"
    },
    "plugins": [
      {
        "name": "analysis-kuromoji",
        "version": "5.6.16",
        "description": "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
        "has_native_controller": false
      },
      {
        "name": "elasticsearch-analysis-openkoreantext",
        "version": "1.0.0",
        "description": "Korean analysis plugin integrates open-korean-text module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.openkoreantext.AnalysisOpenKoreanTextPlugin",
        "has_native_controller": false
      },
      {
        "name": "analysis-smartcn",
        "version": "5.6.16",
        "description": "Smart Chinese Analysis plugin integrates Lucene Smart Chinese analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.smartcn.AnalysisSmartChinesePlugin",
        "has_native_controller": false
      },
      {
        "name": "analysis-stempel",
        "version": "5.6.16",
        "description": "The Stempel (Polish) Analysis plugin integrates Lucene stempel (polish) analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.stempel.AnalysisStempelPlugin",
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "netty4": 3
      },
      "http_types": {
        "netty4": 3
      }
    }
  }
}

Christian_Dahlqvist · November 2, 2022, 11:14am

As you are using a very old version there is some information missing from the stats that are available in newer versions, and I have not used this version in years. Overall I think the stats look OK and there is not much that jumps out, although it looks like you may have quite large documents in your indices. As you apparently are suffering from heap pressure and long and frequent GC it is possible that this may be a contributing factor. As we do not know anything about your use case, data or queries it is hard to tell.

Ankur_Mahajan · November 2, 2022, 12:46pm

Thanks, @Christian_Dahlqvist really appreciate your help. Yes, we have quite large documents. If I configure 3 dedicated master nodes does this prevent the cluster break issue?

Christian_Dahlqvist · November 2, 2022, 12:49pm

Impossible to tell as I do not know exactly what the issue is. It may help with stability, but if it is the data nodes that are suffering from GC issues it may not change things much.

Ankur_Mahajan · November 2, 2022, 12:55pm

Ok, is it recommended to update the Elasticsearch queue size?

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@503a4548 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1ad7ca35[Running, pool size = 49, active threads = 49, queued tasks = 4149, completed tasks = 42552493]]

and what about increasing the ping time out for the transport layer?

Christian_Dahlqvist · November 2, 2022, 1:14pm

Increasing the queue size would likely just keep more in memory and put additional pressure on GC. Increasing ping time does not address the problem. You need to look at your queries and data to see what is driving heap usage.

One thing you could try is to reduce the number of replicas from 2 to 1, as that would reduce the amount of data on each node. This could potentially be an easy way to reduce heap usage.

I would recommend you upgrade to the latest version of Elasticsearch as a lot of improvements have been made since the version you are using came out. There is also not a lot of people with recent experience in your version that may be able to help troubleshoot something like this.

warkolm · November 2, 2022, 9:45pm

Please upgrade, 5.x is so super old and very much EOL and there has been a tonne of improvements around cluster resilience and stability that will help your situation.

Ankur_Mahajan · November 3, 2022, 4:35am

Thanks a lot, @Christian_Dahlqvist for your help and suggestions.

system · December 1, 2022, 4:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Frequency old gc of some nodes in cluster Elasticsearch	17	1542	June 7, 2019
Elasticsearch node crashed Elasticsearch	5	752	August 3, 2022
Elasticsearch cluster is crashing often Elasticsearch	7	799	December 12, 2020
Elasticsearch - Poor cluster performance and stability Elasticsearch	8	1375	July 18, 2019
Elasticsearch operational issue due to garbage collector Elasticsearch	7	2340	October 27, 2017

Cluster breaks very frequently

Related topics