Troubles with heap, gc. Node fault

ThreatInter · December 25, 2020, 6:07am

Hello again, my new topic about our problem.
We have ES cluster with 3 nodes. Each 128 Gb RAM, 31 GB Heap, there our jvm.options:

-Xms31g
-Xmx31g

## GC configuration
#-XX:+UseConcMarkSweepGC
#-XX:CMSInitiatingOccupancyFraction=90
#-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
10-13:-XX:-UseConcMarkSweepGC
10-13:-XX:-UseCMSInitiatingOccupancyOnly
11-:-XX:+UseG1GC
11-:-XX:G1ReservePercent=25
11-:-XX:InitiatingHeapOccupancyPercent=30

Sometimes we get warnings about gc overhead:

[2020-12-25T10:28:29,054][DEBUG][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][young][4544][185] duration [624ms], collections [1]/[1s], total [624ms]/[35.3s], memory [22.9gb]->[7.9gb]/[31gb], all_pools {[young] [15gb]->[48mb]/[0b]}{[old]
[7.7gb]->[7.7gb]/[31gb]}{[survivor] [122.4mb]->[160.5mb]/[0b]}
[2020-12-25T10:28:29,055][WARN ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4544] overhead, spent [624ms] collecting in the last [1s]
[2020-12-25T10:28:34,442][INFO ][o.e.m.j.JvmGcMonitorService] [h1-es01] [gc][4549] overhead, spent [325ms] collecting in the last [1.2s]

Around the same time we also get follower checker errors this from h1-es01:

[2020-12-25T10:29:26,373][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es01] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{ciFEpbFAQyyUlwd-Lv4Kxw}{h1-es02ip}{h1-es02ip:9300}{dimr}]
org.elasticsearch.transport.RemoteTransportException: [h1-es02][h1-es02ip:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}] has been removed from the cluster

This time on h1-es02:

[2020-12-25T10:29:24,384][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} failed too many times
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][h1-es01ip:9300][internal:coordination/fault_detection/follower_check] request_id [329372] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-12-25T10:29:24,385][DEBUG][o.e.c.c.FollowersChecker ] [h1-es02] FollowerChecker{discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2020-12-25T10:29:24,385][DEBUG][o.e.c.s.MasterService    ] [h1-es02] executing cluster state update for [node-left[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{zHhOoPfiTSeHEIwhyBgNpA}{h1-es01ip}{h1-es01ip:9300}{dimr} reason: followers check retry count exceeded]]

Heap graph collected by zabbix for h1-es01 looks like:

Where region 1 we don't writing to cluster and region 2 writing to cluster (maybe not perfectly divided on picture)
Please, help us to understand what going on, maybe or cluster is just too weak for such performance?
It may occur on each node, and usually 1-3 times per day.

dadoonet · December 25, 2020, 8:05am

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

ThreatInter · December 25, 2020, 8:32am

GET /

{
  "name" : "h1-es01",
  "cluster_name" : "h1",
  "cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "oss",
    "build_type" : "rpm",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

GET /_cat/nodes?v

ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
h1-es02ip           75          93   6    9.98   10.19     9.93 dimr      -      h1-es02
h1-es03ip           50          83   5   14.42   10.10     9.55 dimr      -      h1-es03
h1-es01ip           74          98   4    4.47    5.53     5.61 dimr      *      h1-es01

GET /_cat/health?v

epoch      timestamp cluster status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1608884974 08:29:34  h1      green           3         3   2160 1592    0    0        0             0                  -                100.0%

health green now, but in time of node fault it's red (some of our indexes don't have replicas)

GET /_cat/indices?v

github.com

NailBash/just_log/blob/main/indexes

health status index                                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   indexname1_index_2020.12.22               KQRfhbUgSEevSa_i95c04Q   1   1      11026            0     15.6mb          7.8mb
green  open   indexname1_index_2020.12.21               YG_kUjBqQBqktlAl5F3kxw   1   1      17673            0     22.1mb           11mb
green  open   indexname1_index_2020.12.24               cBiFlDdrQI-CY40Lx13VdA   1   1      13128            0       18mb            9mb
green  open   indexname1_index_2020.12.23               -x-iaPy3SX2HtpqOwBeQkQ   1   1      11912            0     16.7mb          8.3mb
green  open   indexname1_index_2020.12.25               c62zDcw3SB6smfSQb7eOwQ   1   1       9818            0     13.6mb          6.8mb
green  open   .kibana_-1957898330_user1_3                   WHtispsnTiCsmEbj9UysUw   1   1          2            0     14.9kb          7.4kb
green  open   .kibana_-1957898330_user1_4                   VVrq-Px1Rm6H0n5pDAnRFw   1   1          3            0     19.8kb          9.9kb
green  open   .kibana_-1957898330_user1_1                   MPGF7h_RSCaa4xwCRoO1Mg   1   1          1            0      7.5kb          3.7kb
green  open   .kibana_-1957898330_user1_2                   ls2H8N4xRd-fSVbzouLMwA   1   1          1            0      7.5kb          3.7kb
green  open   indexname2_index_2020.11.17                 GPQcZIVbRDKrNONp9ihe-g   1   0        722            0    711.3kb        711.3kb
green  open   indexname2_index_2020.11.19                 rrGTNwDvTb62caKT23LMEQ   1   1        982            0      1.5mb        807.7kb
green  open   indexname1_index_2020.12.11               hZrQn6VJQ8OsDiWWoBhkIw   1   1      17536            0     22.2mb           11mb
green  open   indexname5_index_2020.10.19                      D-lG60IBTJS8vINuTnMHcQ   5   0    5333669            0     14.1gb         14.1gb
green  open   indexname5_index_2020.10.18                      e_k9-C3jTGC0FLpFrDBsZQ   5   0     628578            0      1.6gb          1.6gb
green  open   indexname1_index_2020.12.10               dMKSSiODTZmz-N-ZIcPAfQ   1   1      17973            0     22.7mb         11.3mb
green  open   indexname1_index_2020.12.13               POBRK_1AR1Cjv2Hpfo1E7g   1   1      10970            0     15.3mb          7.6mb
green  open   indexname1_index_2020.12.12               uvx1BHtYQz2vNXHiOObeJQ   1   1      10623            0     14.9mb          7.4mb
green  open   indexname1_index_2020.12.15               wlQo3rRySvSZ3ljnHKVXtw   1   1      11156            0     14.8mb          7.3mb
green  open   indexname1_index_2020.12.14               AjHsR1enSPOIoNfbGK-iiA   1   1      11137            0     15.4mb          7.7mb

This file has been truncated. show original

Christian_Dahlqvist · December 25, 2020, 8:44am

Can you please verify that your heap size is set so that you are using compressed pointers. I believe this is printed on startup.

I see that you are using the OSS distribution. Are you using any third-party plugins that could affect heap usage? If you do, it may be useful disabling it and see if that affects heap usage.

ThreatInter · December 25, 2020, 9:05am

Here output of _nodes/stats/jvm?pretty comand

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "h1",
  "nodes" : {
    "qgmMV2UbT-ScN9uRr6YM8g" : {
      "timestamp" : 1608886968477,
      "name" : "h1-es02",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886968159,
        "uptime_in_millis" : 5741814,
        "mem" : {
          "heap_used_in_bytes" : 14157604488,
          "heap_used_percent" : 42,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 245243208,
          "non_heap_committed_in_bytes" : 259325952,
          "pools" : {
            "young" : {
              "used_in_bytes" : 1358954496,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19335741440,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 12649817600,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 12777250304,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 148832392,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 729808896,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 423,
          "peak_count" : 530
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 259,
              "collection_time_in_millis" : 50184
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 24875,
            "used_in_bytes" : 1647377279884,
            "total_capacity_in_bytes" : 1647377279884
          },
          "direct" : {
            "count" : 240,
            "used_in_bytes" : 52044525,
            "total_capacity_in_bytes" : 52044524
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 22755,
          "total_loaded_count" : 22755,
          "total_unloaded_count" : 0
        }
      }
    },
    "MT3BSgtaQBWux8BJDBSsHg" : {
      "timestamp" : 1608886968477,
      "name" : "h1-es01",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886967908,
        "uptime_in_millis" : 8421864,
        "mem" : {
          "heap_used_in_bytes" : 9313690056,
          "heap_used_percent" : 27,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 260485128,
          "non_heap_committed_in_bytes" : 275427328,
          "pools" : {
            "young" : {
              "used_in_bytes" : 192937984,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19906166784,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 9026366968,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 9341971968,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 94385104,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 769955072,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 457,
          "peak_count" : 583
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 182,
              "collection_time_in_millis" : 38222
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 23958,
            "used_in_bytes" : 1584673449064,
            "total_capacity_in_bytes" : 1584673449064
          },
          "direct" : {
            "count" : 281,
            "used_in_bytes" : 51967222,
            "total_capacity_in_bytes" : 51967221
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 23425,
          "total_loaded_count" : 23425,
          "total_unloaded_count" : 0
        }
      }
    },
    "Qshtg7-TQIyxeiccpkmlIA" : {
      "timestamp" : 1608886968478,
      "name" : "h1-es03",
      "transport_address" : ":9300",
      "host" : "",
      "ip" : ":9300",
      "roles" : [
        "data",
        "ingest",
        "master",
        "remote_cluster_client"
      ],
      "jvm" : {
        "timestamp" : 1608886967907,
        "uptime_in_millis" : 4152627,
        "mem" : {
          "heap_used_in_bytes" : 7274334160,
          "heap_used_percent" : 21,
          "heap_committed_in_bytes" : 33285996544,
          "heap_max_in_bytes" : 33285996544,
          "non_heap_used_in_bytes" : 230818648,
          "non_heap_committed_in_bytes" : 245190656,
          "pools" : {
            "young" : {
              "used_in_bytes" : 1006632960,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 19881000960,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 6132806656,
              "max_in_bytes" : 33285996544,
              "peak_used_in_bytes" : 6267024384,
              "peak_max_in_bytes" : 33285996544
            },
            "survivor" : {
              "used_in_bytes" : 134894544,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 932197776,
              "peak_max_in_bytes" : 0
            }
          }
        },
        "threads" : {
          "count" : 407,
          "peak_count" : 490
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 153,
              "collection_time_in_millis" : 28077
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },
        "buffer_pools" : {
          "mapped" : {
            "count" : 24849,
            "used_in_bytes" : 1605830094152,
            "total_capacity_in_bytes" : 1605830094152
          },
          "direct" : {
            "count" : 275,
            "used_in_bytes" : 51910705,
            "total_capacity_in_bytes" : 51910704
          },
          "mapped - 'non-volatile memory'" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 21529,
          "total_loaded_count" : 21529,
          "total_unloaded_count" : 0
        }
      }
    }
  }

heap committed in bytes equals 33285996544 bytes which is 31 Gb exactly
2.No, i don't think so

Christian_Dahlqvist · December 25, 2020, 9:19am

The threshold for compressed pointers might be lower than your setting so you need to check the startup logs to be sure. If you are not using compressed pointers you might be wasting a lot of space.

I believe the cluster stats API contains a list of all installed plugins. Can you provide the full output from this API? If you do not have any plugins installed I would recommend switching to the default distribution so you can secure your cluster.

ThreatInter · December 25, 2020, 9:28am

We use compressed pointers i think

[2020-12-25T11:42:47,307][INFO ][o.e.e.NodeEnvironment    ] [h1-es01] heap size [31gb], compressed ordinary object pointers [true]

We can't change ES distribution because we want authorization

 "plugins" : [
      {
        "name" : "opendistro_alerting",
        "version" : "1.11.0.1",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon OpenDistro alerting plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.alerting.AlertingPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_performance_analyzer",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Performance Analyzer Plugin",
        "classname" : "com.amazon.opendistro.elasticsearch.performanceanalyzer.PerformanceAnalyzerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-knn",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch KNN",
        "classname" : "com.amazon.opendistroforelasticsearch.knn.plugin.KNNPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_security",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-job-scheduler",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch job schduler plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.jobscheduler.JobSchedulerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_sql",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch SQL",
        "classname" : "com.amazon.opendistroforelasticsearch.sql.plugin.SQLPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-anomaly-detection",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon opendistro elasticsearch anomaly detector plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.ad.AnomalyDetectorPlugin",
        "extended_plugins" : [
          "lang-painless",
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_index_management",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro Index Management Plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.indexmanagement.IndexManagementPlugin",
        "extended_plugins" : [
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      }
    ],

Christian_Dahlqvist · December 25, 2020, 9:33am

The default distribution includes security with the free Basic license tier. If we are to be able to help here you probably need to disable OpenDistro and show what difference it makes. That way we will know if it is related to these plugins or not. If this is not possible I would recommend raising this in the OpenDistro forum.

Christian_Dahlqvist · December 25, 2020, 11:44am

I noticed that you did not paste the full output of the cluster stats API. Looking at the previous posts there are a few additional things you should check even though they may not be causing the issues you are seeing:

There seems to be a mismatch between your JVM configuration and the java version reported as used by the plugins. If you had posted the full output this would have included the JVM version used. Not that G1GC is not recommended when using Java8.
It looks like you have a lot of very small indices and shards. Note that this can be very inefficient.

ThreatInter · December 26, 2020, 10:27am

Thank you for answers, didn't see that plugins writes another version of java. We have java 1.8 and 11, but we thought that we using 11. Need to check it out

ThreatInter · December 26, 2020, 10:42am

full /_cluster/stats

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "h1",
  "cluster_uuid" : "s71YMgBeQhyRUyYVIzX2sg",
  "timestamp" : 1608979067489,
  "status" : "green",
  "indices" : {
    "count" : 656,
    "shards" : {
      "total" : 2184,
      "primaries" : 1604,
      "replication" : 0.36159600997506236,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 3.3292682926829267
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 2.4451219512195124
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.6524390243902439
        }
      }
    },
    "docs" : {
      "count" : 7686066538,
      "deleted" : 9218
    },
    "store" : {
      "size_in_bytes" : 9238947906453,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 355440,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 631113633,
      "total_count" : 29915936,
      "hit_count" : 231681,
      "miss_count" : 29684255,
      "cache_size" : 20012,
      "cache_count" : 21126,
      "evictions" : 1114
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 36910,
      "memory_in_bytes" : 1036636656,
      "terms_memory_in_bytes" : 786597824,
      "stored_fields_memory_in_bytes" : 116719504,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 50803008,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 82516320,
      "index_writer_memory_in_bytes" : 23647792,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 45592,
      "max_unsafe_auto_id_timestamp" : 1608978137405,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 1705,
          "index_count" : 384
        },
        {
          "name" : "date",
          "count" : 3903,
          "index_count" : 654
        },
        {
          "name" : "double",
          "count" : 4,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 604,
          "index_count" : 296
        },
        {
          "name" : "geo_point",
          "count" : 1042,
          "index_count" : 139
        },
        {
          "name" : "integer",
          "count" : 737,
          "index_count" : 66
        },
        {
          "name" : "ip",
          "count" : 2290,
          "index_count" : 396
        },
        {
          "name" : "keyword",
          "count" : 113602,
          "index_count" : 653
        },
        {
          "name" : "long",
          "count" : 9372,
          "index_count" : 636
        },
        {
          "name" : "nested",
          "count" : 158,
          "index_count" : 90
        },
        {
          "name" : "object",
          "count" : 31211,
          "index_count" : 632
        },
        {
          "name" : "text",
          "count" : 20142,
          "index_count" : 645
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 3,
      "master" : 3,
      "remote_cluster_client" : 3
    },
    "versions" : [
      "7.9.1"
    ],
    "os" : {
      "available_processors" : 144,
      "allocated_processors" : 144,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 404655390720,
        "free_in_bytes" : 13895954432,
        "used_in_bytes" : 390759436288,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 3
      },
      "open_file_descriptors" : {
        "min" : 8723,
        "max" : 9095,
        "avg" : 8929
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 84442482,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 63599460904,
        "heap_max_in_bytes" : 99857989632
      },
      "threads" : 1542
    },
    "fs" : {
      "total_in_bytes" : 23627102601216,
      "free_in_bytes" : 14381292679168,
      "available_in_bytes" : 13180956553216
    },
    "plugins" : [
      {
        "name" : "opendistro_alerting",
        "version" : "1.11.0.1",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon OpenDistro alerting plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.alerting.AlertingPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_performance_analyzer",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Performance Analyzer Plugin",
        "classname" : "com.amazon.opendistro.elasticsearch.performanceanalyzer.PerformanceAnalyzerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-knn",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch KNN",
        "classname" : "com.amazon.opendistroforelasticsearch.knn.plugin.KNNPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_security",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-job-scheduler",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch job schduler plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.jobscheduler.JobSchedulerPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_sql",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro for Elasticsearch SQL",
        "classname" : "com.amazon.opendistroforelasticsearch.sql.plugin.SQLPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro-anomaly-detection",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Amazon opendistro elasticsearch anomaly detector plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.ad.AnomalyDetectorPlugin",
        "extended_plugins" : [
          "lang-painless",
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      },
      {
        "name" : "opendistro_index_management",
        "version" : "1.11.0.0",
        "elasticsearch_version" : "7.9.1",
        "java_version" : "1.8",
        "description" : "Open Distro Index Management Plugin",
        "classname" : "com.amazon.opendistroforelasticsearch.indexmanagement.IndexManagementPlugin",
        "extended_plugins" : [
          "opendistro-job-scheduler"
        ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 3
      },
      "http_types" : {
        "com.amazon.opendistroforelasticsearch.security.http.OpenDistroSecurityHttpServerTransport" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "oss",
        "type" : "rpm",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

warkolm · December 29, 2020, 6:03am

You've got a lot of shards for that many nodes and the amount of data that you have available.

Please check out Security for Elasticsearch is now free | Elastic Blog

ThreatInter · December 29, 2020, 7:19am

Thank you for answer. I seen that the optimal number of shard per node can be calculated like heap_size*20, so on our nodes it should be ~600 and 1800 for all cluster. It is right?

warkolm · December 29, 2020, 7:28am

Roughly, yes.
But also aim for shards around the 50GB mark.

Christian_Dahlqvist · December 29, 2020, 7:44am

The 20 shards per GB heap is a general recommendation around the maximum number of shards, not optimal. I would say the optimal number of shards is generally significantly lower than the maximum.

ThreatInter · January 11, 2021, 7:29am

Now I enabled some logging and see this,
es-01 master node log:


[2021-01-11T11:36:44,623][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:44,625][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:45,274][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}}
[2021-01-11T11:36:45,541][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:45,616][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:45,617][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:45,625][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:45,626][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:46,543][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:46,617][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:46,619][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:46,627][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:46,630][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:47,546][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:36:47,620][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:47,622][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:47,630][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:36:47,633][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:36:48,220][TRACE][o.e.c.NodeConnectionsService] [h1-es01] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:36:48,548][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}

How you can see sometimes instead of leader check from es03 appears "NodeConnectionsService".
es03 logs:

[2021-01-11T11:36:51,640][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:52,642][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:53,644][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:54,646][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:55,274][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es03] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{192.168.5
7.101}{<es01_ip>:9300}{dimr}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117011842] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:36:55,276][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] scheduling next check of {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr} for [cluster.fault_detection.leader_check
.interval] = 1s
[2021-01-11T11:36:55,648][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:56,276][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] checking {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr} with [cluster.fault_detection.leader_check.timeout] = 10s
[2021-01-11T11:36:56,650][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:57,267][TRACE][o.e.c.NodeConnectionsService] [h1-es03] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:36:57,652][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:58,654][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:36:59,656][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:00,657][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:01,659][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path

ThreatInter · January 11, 2021, 7:30am

And then es03 kicked out from cluster
es01 log:

[2021-01-11T11:38:19,167][TRACE][o.e.t.TaskCancellationService] [h1-es01] task [MT3BSgtaQBWux8BJDBSsHg:161719019] is cancelled
[2021-01-11T11:38:19,168][TRACE][o.e.t.TaskCancellationService] [h1-es01] Sending remove ban for tasks with the parent [MT3BSgtaQBWux8BJDBSsHg:161719019] to the node [{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{192.168.57.1
03}{<es03_ip>:9300}{dimr}]
[2021-01-11T11:38:19,738][TRACE][o.e.c.c.LeaderChecker    ] [h1-es01] handling LeaderCheckRequest{sender={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}}
[2021-01-11T11:38:19,782][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:38:19,783][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, failureCountSinceLastSuccess=0, [cl
uster.fault_detection.follower_check.retry_count]=3} check successful
[2021-01-11T11:38:19,892][TRACE][o.e.c.c.FollowersChecker ] [h1-es01] handleWakeUp: checking {h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} with FollowerCheckRequest{term=129, sender=
{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
[2021-01-11T11:38:19,895][DEBUG][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} failed too many times
org.elasticsearch.transport.RemoteTransportException: [h1-es03][<es03_ip>:9300][internal:coordination/fault_detection/follower_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: onFollowerCheckRequest: received check from faulty master, rejecting FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7
iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}}
        at org.elasticsearch.cluster.coordination.Coordinator.onFollowerCheckRequest(Coordinator.java:264) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$2.doRun(FollowersChecker.java:198) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:38:19,897][DEBUG][o.e.c.c.FollowersChecker ] [h1-es01] FollowerChecker{discoveryNode={h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, failureCountSinceLastSuccess=3, [cl
uster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2021-01-11T11:38:19,897][TRACE][o.e.c.s.MasterService    ] [h1-es01] will process [node-left[{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} reason: followers check retry count exceed
ed]]
[2021-01-11T11:38:19,898][DEBUG][o.e.c.s.MasterService    ] [h1-es01] executing cluster state update for [node-left[{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr} reason: followers ch
eck retry count exceeded]]

es03 log:

[2021-01-11T11:37:10,678][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:11,679][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:12,681][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:13,683][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:14,685][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:15,687][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:16,689][TRACE][o.e.c.c.FollowersChecker ] [h1-es03] responding to FollowerCheckRequest{term=129, sender={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}} on fast path
[2021-01-11T11:37:17,268][TRACE][o.e.c.NodeConnectionsService] [h1-es03] connectDisconnectedTargets: {{h1-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}=ConnectionTarget{discoveryNode={h1
-es03}{Qshtg7-TQIyxeiccpkmlIA}{irRkTH7XQt63qEIGc-SWjA}{<es03_ip>}{<es03_ip>:9300}{dimr}, activityType=IDLE}, {h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}=ConnectionTarget{
discoveryNode={h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}, activityType=IDLE}, {h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}=
ConnectionTarget{discoveryNode={h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{1Oc8tIBFR428oBjV_uLjHw}{<es02_ip>}{<es02_ip>:9300}{dimr}, activityType=IDLE}}
[2021-01-11T11:37:17,282][DEBUG][o.e.c.c.LeaderChecker    ] [h1-es03] leader [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_det
ection.leader_check.retry_count] is 3); last failure was:
org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117012924] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-01-11T11:37:17,284][INFO ][o.e.c.c.Coordinator      ] [h1-es03] master node [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{<es01_ip>}{<es01_ip>:9300}{dimr}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.9.1.jar:7.9.1]
        at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityInterceptor$RestoringTransportResponseHandler.handleException(OpenDistroSecurityInterceptor.java:277) ~[?:?]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1172) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1073) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [h1-es01][<es01_ip>:9300][internal:coordination/fault_detection/leader_check] request_id [117012924] timed out after [10006ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) ~[elasticsearch-7.9.1.jar:7.9.1]
        ... 4 more
[2021-01-11T11:37:17,292][DEBUG][o.e.c.c.Coordinator      ] [h1-es03] onLeaderFailure: coordinator becoming CANDIDATE in term 129 (was FOLLOWER, lastKnownLeader was [Optional[{h1-es01}{MT3BSgtaQBWux8BJDBSsHg}{5OJxyuZrR7iXpeL4OqyDiQ}{192.
168.57.101}{<es01_ip>:9300}{dimr}]])
[2021-01-11T11:37:17,295][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] setCurrentNodes: nodes:

[2021-01-11T11:37:17,295][TRACE][o.e.c.c.LeaderChecker    ] [h1-es03] already closed, doing nothing
[2021-01-11T11:37:17,296][TRACE][o.e.c.c.PreVoteCollector ] [h1-es03] updating with preVoteResponse=PreVoteResponse{currentTerm=129, lastAcceptedTerm=129, lastAcceptedVersion=396235}, leader=null
[2021-01-11T11:37:17,296][DEBUG][o.e.c.s.ClusterApplierService] [h1-es03] processing [becoming candidate: onLeaderFailure]: execute
[2021-01-11T11:37:17,407][DEBUG][o.e.c.s.ClusterApplierService] [h1-es03] cluster state updated, version [396235], source [becoming candidate: onLeaderFailure]

Anyone have any thoughts on the reasons for this behavior?

system · February 8, 2021, 7:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heap memory leak in Elasticsearch 6.2.4 Elasticsearch	5	1881	March 3, 2020
Long gc pause happened on es1.7.0 plus jdk8u40 Elasticsearch	15	717	July 5, 2017
ElasticSearch gc performance on cluster Elasticsearch	3	654	July 5, 2017
High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC) Elasticsearch docker	6	1296	March 7, 2023
Heap usage causing node failure - 5.5.2 Elasticsearch	4	614	September 22, 2017

Troubles with heap, gc. Node fault

Related topics