Elasticsearch change state

Hello,
I'm running Elasticsearch 5.6 and when my Cluster became yellow I have these logs

In master:

[2020-05-15T14:50:57,908][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-elasticsearch_master_02] failed to execute on node [HlG_sniCQLO4KdpkV_EgjA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-elasticsearch_data_03][172.26.44.163:9300][cluster:monitor/nodes/stats[n]] request_id [42924423] timed out after [15001ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.9.jar:5.6.9]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
[2020-05-15T14:51:49,057][INFO ][o.e.c.r.a.AllocationService] [node-elasticsearch_master_02] Cluster health status changed from [GREEN] to [YELLOW] (reason: []).
[2020-05-15T14:51:49,060][INFO ][o.e.c.s.ClusterService   ] [node-elasticsearch_master_02] removed {{node-elasticsearch_data_03}{HlG_sniCQLO4KdpkV_EgjA}{BhBWwYheS8GrJmt0yvmAEQ}{172.26.44.163}{172.26.44.163:9300}{rack=rack-prod},}, reason: zen-disco-node-failed({node-elasticsearch_data_03}{HlG_sniCQLO4KdpkV_EgjA}{BhBWwYheS8GrJmt0yvmAEQ}{172.26.44.163}{172.26.44.163:9300}{rack=rack-prod}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)
[2020-05-15T14:51:51,434][WARN ][o.e.t.TransportService   ] [node-elasticsearch_master_02] Received response for a request that has timed out, sent [68527ms] ago, timed out [53526ms] ago, action [

In data

[2020-05-15T14:49:37,709][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184550] overhead, spent [1.2m] collecting in the last [1.2m]
[2020-05-15T14:49:38,711][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184551] overhead, spent [718ms] collecting in the last [1s]
[2020-05-15T14:50:05,341][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184577] overhead, spent [1.3s] collecting in the last [1.5s]
[2020-05-15T14:50:14,464][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184586] overhead, spent [698ms] collecting in the last [1.1s]
[2020-05-15T14:51:51,485][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][old][184590][2267] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[1.1h], memory [21.8gb]->[21.2gb]/[21.9gb], all_pools {[young] [498.9mb]->[72.5mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [21.2gb]->[21.1gb]/[21.3gb]}
[2020-05-15T14:51:51,485][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184590] overhead, spent [1.5m] collecting in the last [1.5m]
[2020-05-15T14:51:52,485][WARN ][o.e.m.j.JvmGcMonitorService] [node-elasticsearch_data_03] [gc][184591] overhead, spent [651ms] collecting in the last [1s]
[2020-05-15T14:51:52,601][WARN ][o.e.t.n.Netty4Transport  ] [node-elasticsearch_data_03] write and flush on the network layer failed (channel: [id: 0x6924cae3, L:/172.26.44.163:9300 ! R:/172.26.44.169:49866])
java.nio.channels.ClosedChannelException: null
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Any idea?

Many thanks

The JVM is garbage collecting frequently and takes time to clean up.
Plus there was an interruption of communication across nodes and the cluster went yellow.

I am tempted to say the cluster is over sharded or there was a bad query which loaded a lot of field data / global ordinals.

Can you share the output of:

  • GET _tasks?detailed
  • GET _nodes/hot_threads
  • GET _cat/fielddata
  • GET _cat/health?v




GET _cat/fielddata

HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 mer.name.raw    40.9kb
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 mer.id.raw        968b
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 st.name.raw       17.6kb
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 pos.poiC.raw   6.2kb
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 appLabel.raw 12.4kb
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 phyPoiId.raw        0b
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 trType.raw    376b
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 ac.name.raw     9.7kb
HlG_sniCQLO4KdpkV_EgjA 172.26.44.163 172.26.44.163 node-elasticsearch_data_03 pos.id.raw            2.9mb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 trType.raw      0b
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 mer.name.raw    33.5kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 st.name.raw       19.3kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 phyPoiId.raw        0b
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 pos.poiC.raw   4.3kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 mer.id.raw         1kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 ac.name.raw    39.4kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 appLabel.raw 23.8kb
dLL8RK21Siuuy6cPsOEBsQ 172.26.44.162 172.26.44.162 node-elasticsearch_data_02 pos.id.raw              2mb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 pos.id.raw            1.9mb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 mer.name.raw    54.9kb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 pos.poiC.raw   3.8kb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 ac.name.raw    12.5kb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 mer.id.raw        944b
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 appLabel.raw 14.1kb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 st.name.raw       25.1kb
lAvP3H_wSwqCu5-ePjA_JA 172.26.44.164 172.26.44.164 node-elasticsearch_data_01 trType.raw    776b
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 mer.id.raw       1.1kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 st.name.raw       54.9kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 pos.poiC.raw   5.6kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 trType.raw    376b
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 ac.name.raw    10.4kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 mer.name.raw    23.7kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 appLabel.raw 13.5kb
LZCQXIyXT3COSZuPgaqIXQ 172.26.44.165 172.26.44.165 node-elasticsearch_data_06 pos.id.raw            2.8mb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 appLabel.raw 11.2kb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 ac.name.raw     9.2kb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 trType.raw    824b
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 pos.poiC.raw   6.3kb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 mer.id.raw        424b
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 st.name.raw         17kb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 pos.id.raw            2.5mb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 mer.name.raw    58.7kb
VaNqDlBtRyCUFEvdjuq5FQ 172.26.44.167 172.26.44.167 node-elasticsearch_data_04 phyPoiId.raw        0b
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 pos.id.raw              3mb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 mer.name.raw    52.2kb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 pos.poiC.raw   6.6kb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 ac.name.raw    12.2kb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 appLabel.raw 14.2kb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 trType.raw    376b
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 st.name.raw       50.2kb
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 phyPoiId.raw        0b
zGoA-maVQWafY2VczdEWTQ 172.26.44.166 172.26.44.166 node-elasticsearch_data_05 mer.id.raw       1.6kb

GET _cat/health?v

epoch      timestamp cluster             status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1589708513 09:41:53  cluster-prod-eu green           9         6   8902 4447    0    0        0             0                  -                100.0%

GET _tasks?detailed

{
  "nodes": {
    "8aneNwHESyK8_G5BHmVonA": {
      "name": "node-elasticsearch_master_02",
      "transport_address": "172.26.44.168:9300",
      "host": "172.26.44.168",
      "ip": "172.26.44.168:9300",
      "roles": [
        "master",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "8aneNwHESyK8_G5BHmVonA:68257647": {
          "node": "8aneNwHESyK8_G5BHmVonA",
          "id": 68257647,
          "type": "direct",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248757,
          "running_time_in_nanos": 3476783,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        },
        "8aneNwHESyK8_G5BHmVonA:68257646": {
          "node": "8aneNwHESyK8_G5BHmVonA",
          "id": 68257646,
          "type": "transport",
          "action": "cluster:monitor/tasks/lists",
          "description": "",
          "start_time_in_millis": 1589708248755,
          "running_time_in_nanos": 6152710,
          "cancellable": false
        }
      }
    },
    "dLL8RK21Siuuy6cPsOEBsQ": {
      "name": "node-elasticsearch_data_02",
      "transport_address": "172.26.44.162:9300",
      "host": "172.26.44.162",
      "ip": "172.26.44.162:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "dLL8RK21Siuuy6cPsOEBsQ:7142733": {
          "node": "dLL8RK21Siuuy6cPsOEBsQ",
          "id": 7142733,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248763,
          "running_time_in_nanos": 40498704,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        },
        "dLL8RK21Siuuy6cPsOEBsQ:7142734": {
          "node": "dLL8RK21Siuuy6cPsOEBsQ",
          "id": 7142734,
          "type": "netty",
          "action": "indices:data/write/bulk[s]",
          "status": {
            "phase": "waiting_on_primary"
          },
          "description": "requests[1], index[sim-status]",
          "start_time_in_millis": 1589708248796,
          "running_time_in_nanos": 7338124,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257650"
        },
        "dLL8RK21Siuuy6cPsOEBsQ:7142735": {
          "node": "dLL8RK21Siuuy6cPsOEBsQ",
          "id": 7142735,
          "type": "direct",
          "action": "indices:data/write/bulk[s][p]",
          "status": {
            "phase": "primary"
          },
          "description": "requests[1], index[sim-status]",
          "start_time_in_millis": 1589708248797,
          "running_time_in_nanos": 7230683,
          "cancellable": false,
          "parent_task_id": "dLL8RK21Siuuy6cPsOEBsQ:7142734"
        }
      }
    },
    "Y24oHA4cRruskaS4jsNM5w": {
      "name": "node-elasticsearch_master_03",
      "transport_address": "172.26.44.170:9300",
      "host": "172.26.44.170",
      "ip": "172.26.44.170:9300",
      "roles": [
        "master",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "Y24oHA4cRruskaS4jsNM5w:67675464": {
          "node": "Y24oHA4cRruskaS4jsNM5w",
          "id": 67675464,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248759,
          "running_time_in_nanos": 3135740,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "LZCQXIyXT3COSZuPgaqIXQ": {
      "name": "node-elasticsearch_data_06",
      "transport_address": "172.26.44.165:9300",
      "host": "172.26.44.165",
      "ip": "172.26.44.165:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "LZCQXIyXT3COSZuPgaqIXQ:13444318": {
          "node": "LZCQXIyXT3COSZuPgaqIXQ",
          "id": 13444318,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248767,
          "running_time_in_nanos": 27232698,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "lAvP3H_wSwqCu5-ePjA_JA": {
      "name": "node-elasticsearch_data_01",
      "transport_address": "172.26.44.164:9300",
      "host": "172.26.44.164",
      "ip": "172.26.44.164:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "lAvP3H_wSwqCu5-ePjA_JA:14690190": {
          "node": "lAvP3H_wSwqCu5-ePjA_JA",
          "id": 14690190,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248768,
          "running_time_in_nanos": 28755940,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "VaNqDlBtRyCUFEvdjuq5FQ": {
      "name": "node-elasticsearch_data_04",
      "transport_address": "172.26.44.167:9300",
      "host": "172.26.44.167",
      "ip": "172.26.44.167:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "VaNqDlBtRyCUFEvdjuq5FQ:17632205": {
          "node": "VaNqDlBtRyCUFEvdjuq5FQ",
          "id": 17632205,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248771,
          "running_time_in_nanos": 111489619,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "HlG_sniCQLO4KdpkV_EgjA": {
      "name": "node-elasticsearch_data_03",
      "transport_address": "172.26.44.163:9300",
      "host": "172.26.44.163",
      "ip": "172.26.44.163:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "HlG_sniCQLO4KdpkV_EgjA:11458810": {
          "node": "HlG_sniCQLO4KdpkV_EgjA",
          "id": 11458810,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248761,
          "running_time_in_nanos": 21424695,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "zGoA-maVQWafY2VczdEWTQ": {
      "name": "node-elasticsearch_data_05",
      "transport_address": "172.26.44.166:9300",
      "host": "172.26.44.166",
      "ip": "172.26.44.166:9300",
      "roles": [
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "zGoA-maVQWafY2VczdEWTQ:12180024": {
          "node": "zGoA-maVQWafY2VczdEWTQ",
          "id": 12180024,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248769,
          "running_time_in_nanos": 125469930,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    },
    "ITCOk7yJRHikijhFUXG1Iw": {
      "name": "node-elasticsearch_master_01",
      "transport_address": "172.26.44.169:9300",
      "host": "172.26.44.169",
      "ip": "172.26.44.169:9300",
      "roles": [
        "master",
        "ingest"
      ],
      "attributes": {
        "rack": "rack-prod"
      },
      "tasks": {
        "ITCOk7yJRHikijhFUXG1Iw:82529329": {
          "node": "ITCOk7yJRHikijhFUXG1Iw",
          "id": 82529329,
          "type": "netty",
          "action": "cluster:monitor/tasks/lists[n]",
          "description": "",
          "start_time_in_millis": 1589708248759,
          "running_time_in_nanos": 3453253,
          "cancellable": false,
          "parent_task_id": "8aneNwHESyK8_G5BHmVonA:68257646"
        }
      }
    }
  }
}

It looks like you are having problems with heap pressure and frequent, long GC. What is the output of GET /_nodes/stats/jvm, indices and GET /_cluster/stats APIs?

From the information you've sent:

  • the cluster is not receiving a lot of indexing requests, so the instability seems not triggered by the indexing activity
  • field data usage is ok, so the heap consumption doesn't come from global ordinals
  • you have too many shards per node: this means a lot of JVM heap is being used just to track the segments/shards of the cluster
    • too many shards might lead to too many open descriptors: do you have the full stack trace following java.nio.channels.ClosedChannelException?
    • try to force merge old indices you know are read only (you will no more write or them), or delete old data on the cluster
  • The cluster is green now but garbage collection is making it instable as some nodes become so busy that they do not reply to pings for more than 15s

With GET _cat/shards?v&b=b we can see the number of shards, the average size and the number of shards per index to confirm that.

  • I don't remember that I reinedex Elasticsearch. If Indexing is necessary how can Index it?
  • Should I found java.nio.channels.ClosedChannelException in what kind of file.
  • How can I Merge old indices.
  • If I merge old indices, I could read information but I cannot modify entries is it? Is indices are accessible or not?
  • Howto Avoid/minimize GC in my Cluster.

Please find result of your request:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.