Troubles with heap, gc. Node fault

Hello again, my new topic about our problem.
We have ES cluster with 3 nodes. Each 128 Gb RAM, 31 GB Heap, there our jvm.options:

-Xms31g
-Xmx31g

## GC configuration
#-XX:+UseConcMarkSweepGC
#-XX:CMSInitiatingOccupancyFraction=90
#-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
10-13:-XX:-UseConcMarkSweepGC
10-13:-XX:-UseCMSInitiatingOccupancyOnly
11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30

Sometimes we get warnings about gc overhead:

[2020-12-25T10:28:29,054][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][young][4544][185] duration [624ms], collections [1]/[1s], total [624ms]/[35.3s], memory [22.9gb]->[7.9gb]/[31gb], all_pools {[young] [15gb]->[48mb]/[0b]}{[old]
[7.7gb]->[7.7gb]/[31gb]}{[survivor] [122.4mb]->[160.5mb]/[0b]}
[2020-12-25T10:28:29,055][WARN ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4544] overhead, spent [624ms] collecting in the last [1s]
[2020-12-25T10:28:34,442][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4549] overhead, spent [325ms] collecting in the last [1.2s]

Around the same time we also get follower checker errors this from h1-es01:

[2020-12-25T10:29:26,373][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es01] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{ciFEpbFAQyyUlwd-Lv4Kxw}{h1-es02ip}{h1-es02ip:9300}{dimr}]
org.elasticsearch.transport.RemoteTransportException: [h1-es02][h1-es02ip:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}] has been removed from the cluster

This time on h1-es02:

[2020-12-25T10:29:24,384][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} failed too many times
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][h1-es01ip:9300][internal:coordination/fault_detection/follower_check] request_id [329372] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-12-25T10:29:24,385][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2020-12-25T10:29:24,385][DEBUG][o.e.c.s.MasterService    ] [h1-es02] executing cluster state update for [node-left[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr} reason: followers check retry count exceeded]]

Heap graph collected by zabbix for h1-es01 looks like:


Where region 1 we don't writing to cluster and region 2 writing to cluster (maybe not perfectly divided on picture)
Please, help us to understand what going on, maybe or cluster is just too weak for such performance?
It may occur on each node, and usually 1-3 times per day.

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

GET /

{
  "name" : "h1-es01",
  "cluster_name" : "h1",
  "cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "oss",
    "build_type" : "rpm",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

GET /_cat/nodes?v

ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
h1-es02ip           75          93   6    9.98   10.19     9.93 dimr      -      h1-es02
h1-es03ip           50          83   5   14.42   10.10     9.55 dimr      -      h1-es03
h1-es01ip           74          98   4    4.47    5.53     5.61 dimr      *      h1-es01

GET /_cat/health?v

epoch      timestamp cluster status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1608884974 08:29:34  h1      green           3         3   2160 1592    0    0        0             0                  -                100.0%

health green now, but in time of node fault it's red (some of our indexes don't have replicas)

GET /_cat/indices?v

Can you please verify that your heap size is set so that you are using compressed pointers. I believe this is printed on startup.

I see that you are using the OSS distribution. Are you using any third-party plugins that could affect heap usage? If you do, it may be useful disabling it and see if that affects heap usage.

  1. Here output of _nodes/stats/jvm?pretty comand
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "h1",
  "nodes" : {
    "qgmMV2UbT-ScN9uRr6YM8g" : {
      "timestamp" : 1608886968477,
      "name" : "h1-es02",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886968159,
        "uptime_in_millis" : 5741814,
        "mem" : {
          "heap_used_in_bytes" : 14157604488,
          "heap_used_percent" : 42,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 245243208,
          "non_heap_committed_in_bytes" : 259325952,
          "pools" : {
            "young" : {
              "used_in_bytes" : 1358954496,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19335741440,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 12649817600,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 12777250304,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 148832392,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 729808896,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 423,
          "peak_count" : 530
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 259,
              "collection_time_in_millis" : 50184
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 24875,
            "used_in_bytes" : 1647377279884,
            "total_capacity_in_bytes" : 1647377279884
          },
          "direct" : {
            "count" : 240,
            "used_in_bytes" : 52044525,
            "total_capacity_in_bytes" : 52044524
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 22755,
          "total_loaded_count" : 22755,
          "total_unloaded_count" : 0
        }
      }
    },
    "MT3BSgtaQBWux8BJDBSsHg" : {
      "timestamp" : 1608886968477,
      "name" : "h1-es01",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886967908,
        "uptime_in_millis" : 8421864,
        "mem" : {
          "heap_used_in_bytes" : 9313690056,
          "heap_used_percent" : 27,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 260485128,
          "non_heap_committed_in_bytes" : 275427328,
          "pools" : {
            "young" : {
              "used_in_bytes" : 192937984,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19906166784,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 9026366968,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 9341971968,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 94385104,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 769955072,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 457,
          "peak_count" : 583
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 182,
              "collection_time_in_millis" : 38222
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 23958,
            "used_in_bytes" : 1584673449064,
            "total_capacity_in_bytes" : 1584673449064
          },
          "direct" : {
            "count" : 281,
            "used_in_bytes" : 51967222,
            "total_capacity_in_bytes" : 51967221
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 23425,
          "total_loaded_count" : 23425,
          "total_unloaded_count" : 0
        }
      }
    },
    "Qshtg7-TQIyxeiccpkmlIA" : {
      "timestamp" : 1608886968478,
      "name" : "h1-es03",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886967907,
        "uptime_in_millis" : 4152627,
        "mem" : {
          "heap_used_in_bytes" : 7274334160,
          "heap_used_percent" : 21,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 230818648,
          "non_heap_committed_in_bytes" : 245190656,
          "pools" : {
            "young" : {
              "used_in_bytes" : 1006632960,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19881000960,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 6132806656,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 6267024384,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 134894544,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 932197776,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 407,
          "peak_count" : 490
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 153,
              "collection_time_in_millis" : 28077
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 24849,
            "used_in_bytes" : 1605830094152,
            "total_capacity_in_bytes" : 1605830094152
          },
          "direct" : {
            "count" : 275,
            "used_in_bytes" : 51910705,
            "total_capacity_in_bytes" : 51910704
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 21529,
          "total_loaded_count" : 21529,
          "total_unloaded_count" : 0
        }
      }
    }
  }

heap committed in bytes equals 33285996544 bytes which is 31 Gb exactly
2.No, i don't think so

The threshold for compressed pointers might be lower than your setting so you need to check the startup logs to be sure. If you are not using compressed pointers you might be wasting a lot of space.

I believe the cluster stats API contains a list of all installed plugins. Can you provide the full output from this API? If you do not have any plugins installed I would recommend switching to the default distribution so you can secure your cluster.

We use compressed pointers i think

[2020-12-25T11:42:47,307][INFO ][o.e.e.NodeEnvironment    ] [h1-es01] heap size [31gb], compressed ordinary object pointers [true]

We can't change ES distribution because we want authorization

 "plugins" : [
      {
        "name" : "opendistro_alerting",
        "version" : "1.11.0.1",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon OpenDistro alerting plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.alerting.AlertingPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_performance_analyzer",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Performance Analyzer Plugin",
        "classname" : "com.amazon.opendistro.elasticsearch.performanceanalyzer.PerformanceAnalyzerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-knn",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch KNN",
        "classname" : "com.amazon.opendistroforelasticsearch.knn.plugin.KNNPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_security",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-job-scheduler",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch job schduler plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.jobscheduler.JobSchedulerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_sql",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch SQL",
        "classname" : "com.amazon.opendistroforelasticsearch.sql.plugin.SQLPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-anomaly-detection",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon opendistro elasticsearch anomaly detector plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.ad.AnomalyDetectorPlugin",
        "extended_plugins" : [
          "lang-painless",
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_index_management",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro Index Management Plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.indexmanagement.IndexManagementPlugin",
        "extended_plugins" : [
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      }
    ],

The default distribution includes security with the free Basic license tier. If we are to be able to help here you probably need to disable OpenDistro and show what difference it makes. That way we will know if it is related to these plugins or not. If this is not possible I would recommend raising this in the OpenDistro forum.

2 Likes

I noticed that you did not paste the full output of the cluster stats API. Looking at the previous posts there are a few additional things you should check even though they may not be causing the issues you are seeing:

  • There seems to be a mismatch between your JVM configuration and the java version reported as used by the plugins. If you had posted the full output this would have included the JVM version used. Not that G1GC is not recommended when using Java8.
  • It looks like you have a lot of very small indices and shards. Note that this can be very inefficient.
1 Like

Thank you for answers, didn't see that plugins writes another version of java. We have java 1.8 and 11, but we thought that we using 11. Need to check it out

full /_cluster/stats

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "h1",
  "cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
  "timestamp" : 1608979067489,
  "status" : "green",
  "indices" : {
    "count" : 656,
    "shards" : {
      "total" : 2184,
      "primaries" : 1604,
      "replication" : 0.36159600997506236,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 3.3292682926829267
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 2.4451219512195124
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.6524390243902439
        }
      }
    },
    "docs" : {
      "count" : 7686066538,
      "deleted" : 9218
    },
    "store" : {
      "size_in_bytes" : 9238947906453,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 355440,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 631113633,
      "total_count" : 29915936,
      "hit_count" : 231681,
      "miss_count" : 29684255,
      "cache_size" : 20012,
      "cache_count" : 21126,
      "evictions" : 1114
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 36910,
      "memory_in_bytes" : 1036636656,
      "terms_memory_in_bytes" : 786597824,
      "stored_fields_memory_in_bytes" : 116719504,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 50803008,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 82516320,
      "index_writer_memory_in_bytes" : 23647792,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 45592,
      "max_unsafe_auto_id_timestamp" : 1608978137405,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 1705,
          "index_count" : 384
        },
        {
          "name" : "date",
          "count" : 3903,
          "index_count" : 654
        },
        {
          "name" : "double",
          "count" : 4,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 604,
          "index_count" : 296
        },
        {
          "name" : "geo_point",
          "count" : 1042,
          "index_count" : 139
        },
        {
          "name" : "integer",
          "count" : 737,
          "index_count" : 66
        },
        {
          "name" : "ip",
          "count" : 2290,
          "index_count" : 396
        },
        {
          "name" : "keyword",
          "count" : 113602,
          "index_count" : 653
        },
        {
          "name" : "long",
          "count" : 9372,
          "index_count" : 636
        },
        {
          "name" : "nested",
          "count" : 158,
          "index_count" : 90
        },
        {
          "name" : "object",
          "count" : 31211,
          "index_count" : 632
        },
        {
          "name" : "text",
          "count" : 20142,
          "index_count" : 645
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 3,
      "master" : 3,
      "remote_cluster_client" : 3
    },
    "versions" : [
      "7.9.1"
    ],
    "os" : {
      "available_processors" : 144,
      "allocated_processors" : 144,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 404655390720,
        "free_in_bytes" : 13895954432,
        "used_in_bytes" : 390759436288,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 3
      },
      "open_file_descriptors" : {
        "min" : 8723,
        "max" : 9095,
        "avg" : 8929
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 84442482,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 63599460904,
        "heap_max_in_bytes" : 99857989632
      },
      "threads" : 1542
    },
    "fs" : {
      "total_in_bytes" : 23627102601216,
      "free_in_bytes" : 14381292679168,
      "available_in_bytes" : 13180956553216
    },
    "plugins" : [
      {
        "name" : "opendistro_alerting",
        "version" : "1.11.0.1",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon OpenDistro alerting plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.alerting.AlertingPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_performance_analyzer",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Performance Analyzer Plugin",
        "classname" : "com.amazon.opendistro.elasticsearch.performanceanalyzer.PerformanceAnalyzerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-knn",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch KNN",
        "classname" : "com.amazon.opendistroforelasticsearch.knn.plugin.KNNPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_security",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-job-scheduler",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch job schduler plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.jobscheduler.JobSchedulerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_sql",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch SQL",
        "classname" : "com.amazon.opendistroforelasticsearch.sql.plugin.SQLPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-anomaly-detection",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon opendistro elasticsearch anomaly detector plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.ad.AnomalyDetectorPlugin",
        "extended_plugins" : [
          "lang-painless",
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_index_management",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro Index Management Plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.indexmanagement.IndexManagementPlugin",
        "extended_plugins" : [
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 3
      },
      "http_types" : {
        "com.amazon.opendistroforelasticsearch.security.http.OpenDistroSecurityHttpServerTransport" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "oss",
        "type" : "rpm",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

You've got a lot of shards for that many nodes and the amount of data that you have available.

Please check out Security for Elasticsearch is now free | Elastic Blog

Thank you for answer. I seen that the optimal number of shard per node can be calculated like heap_size*20, so on our nodes it should be ~600 and 1800 for all cluster. It is right?

Roughly, yes.
But also aim for shards around the 50GB mark.

The 20 shards per GB heap is a general recommendation around the maximum number of shards, not optimal. I would say the optimal number of shards is generally significantly lower than the maximum.

2 Likes

Now I enabled some logging and see this,
es-01 master node log:


[2021-01-11T11:36:44,623][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:44,625][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:45,274][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}}
[2021-01-11T11:36:45,541][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:45,616][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:45,617][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:45,625][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:45,626][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:46,543][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:46,617][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:46,619][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:46,627][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:46,630][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:47,546][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:47,620][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:47,622][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:47,630][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:47,633][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:48,220][TRACE][o.e.c.NodeConnectionsService] [h1-es01] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:36:48,548][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}

How you can see sometimes instead of leader check from es03 appears "NodeConnectionsService".
es03 logs:

[2021-01-11T11:36:51,640][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:52,642][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:53,644][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:54,646][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:55,274][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es03] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{192.168.5
7.101}{<es01_ip>:9300}{dimr}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117011842] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:36:55,276][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] scheduling next check of {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr} for [cluster.fault_detection.leader_check
.interval] = 1s
[2021-01-11T11:36:55,648][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:56,276][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] checking {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr} with [cluster.fault_detection.leader_check.timeout] = 10s
[2021-01-11T11:36:56,650][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:57,267][TRACE][o.e.c.NodeConnectionsService] [h1-es03] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:36:57,652][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:58,654][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:59,656][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:00,657][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:01,659][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path

And then es03 kicked out from cluster
es01 log:

[2021-01-11T11:38:19,167][TRACE][o.e.t.TaskCancellationService] [h1-es01] task [MT3BSgtaQBWux8BJDBSsHg:161719019] is cancelled
[2021-01-11T11:38:19,168][TRACE][o.e.t.TaskCancellationService] [h1-es01] Sending remove ban for tasks with the parent [MT3BSgtaQBWux8BJDBSsHg:161719019] to the node [{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{192.168.57.1
03}{<es03_ip>:9300}{dimr}]
[2021-01-11T11:38:19,738][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:38:19,782][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:38:19,783][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:38:19,892][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:38:19,895][DEBUG][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} failed too many times
org.elasticsearch.transport.RemoteTransportException: [h1-es03][<es03_ip>:9300][internal:coordination/fault_detection/follower_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: onFollowerCheckRequest: received check from faulty master, rejecting FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7
iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
        at org.elasticsearch.cluster.coordination.Coordinator.onFollowerCheckRequest(Coordinator.java:264) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$2.doRun(FollowersChecker.java:198) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:38:19,897][DEBUG][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2021-01-11T11:38:19,897][TRACE][o.e.c.s.MasterService    ] [h1-es01] will process [node-left[{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} reason: followers check retry count exceed
ed]]
[2021-01-11T11:38:19,898][DEBUG][o.e.c.s.MasterService    ] [h1-es01] executing cluster state update for [node-left[{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} reason: followers ch
eck retry count exceeded]]

es03 log:

[2021-01-11T11:37:10,678][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:11,679][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:12,681][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:13,683][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:14,685][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:15,687][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:16,689][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:17,268][TRACE][o.e.c.NodeConnectionsService] [h1-es03] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:37:17,282][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es03] leader [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_det
ection.leader_check.retry_count] is 3); last failure was:
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117012924] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:37:17,284][INFO ][o.e.c.c.Coordinator      ] [h1-es03] master node [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.9.1.jar:7.9.1]
        at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityInterceptor$RestoringTransportResponseHandler.handleException(OpenDistroSecurityInterceptor.java:277) ~[?:?]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1172) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1073) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117012924] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) ~[elasticsearch-7.9.1.jar:7.9.1]
        ... 4 more
[2021-01-11T11:37:17,292][DEBUG][o.e.c.c.Coordinator      ] [h1-es03] onLeaderFailure: coordinator becoming CANDIDATE in term 129 (was FOLLOWER, lastKnownLeader was [Optional[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{192.
168.57.101}{<es01_ip>:9300}{dimr}]])
[2021-01-11T11:37:17,295][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] setCurrentNodes: nodes:

[2021-01-11T11:37:17,295][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] already closed, doing nothing
[2021-01-11T11:37:17,296][TRACE][o.e.c.c.PreVoteCollector ] [h1-es03] updating with preVoteResponse=PreVoteResponse{currentTerm=129, lastAcceptedTerm=129, lastAcceptedVersion=396235}, leader=null
[2021-01-11T11:37:17,296][DEBUG][o.e.c.s.ClusterApplierService] [h1-es03] processing [becoming candidate: onLeaderFailure]: execute
[2021-01-11T11:37:17,407][DEBUG][o.e.c.s.ClusterApplierService] [h1-es03] cluster state updated, version [396235], source [becoming candidate: onLeaderFailure]

Anyone have any thoughts on the reasons for this behavior?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.