ES persistent outages

Hello. We recently upgraded to ES 7.17.9 (8.x is on the radar, but we have a lot of reindexing to do before then) and are having a lot of problems. We're using a fairly stock configuration on EC2.

Our setup consists of four c6id.8xlarge data nodes (32 cores, 64gb ram) for a total of 128 cores / 256 gb ram all using 100% NVME storage. We have three c6i.2xlarge master nodes (8 cores, 16gb). So 7 nodes total.

We have 47 indices over 604 primary shards, with 1812 total shards (prim + rep). We have 1089185044 total docs over ~1.8tb (yes, we are over-sharded). Many of our indices were created on old versions:

  • 21 on 6.8
  • 16 on 7.10

We're seeing an issue where when certain queries that run against all indices the logs show a lot of:

handling response [InboundMessage{Header{84861}{7.17.9}{15773635}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] on handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$4/[indices:data/read/search[phase/query]]:org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler@7e88bfb/org.elasticsearch.action.search.SearchExecutionStatsCollector/org.elasticsearch.action.search.AbstractSearchAsyncAction$1@7cf41599] took [26208ms] which is above the warn threshold of [5000ms]

and

sending transport message [Request{indices:admin/seq_no/retention_lease_background_sync[r]}{15773899}{false}{false}{false}] of size [535] on [Netty4TcpChannel{localAddress=/172.31.31.241:35100, remoteAddress=172.31.4.198/172.31.4.198:9300, profile=default}] took [23207ms] which is above the warn threshold of [5000ms] with success [true]

What follows afterwards is one or more data nodes drop from the cluster, and they do not rejoin until hours (sometimes 6 or 8 hours) later. Leading up to (And during) the outages, the CPU, memory, network, and disk metrics are all low...there's no obvious resource that seems to be starved. The only way to remedy this quickly is a rolling restart. Another odd symptom we've see is that queued search and search_coordination threads climb steadily while this is happening.

On version 5.x we saw issues where large queries over all our indices would cause the cluster to overload for a few minutes, but it would eventually come back. But what's happening now seems to be making the cluster extremely fragile in almost unpredictable ways: sometimes the same workloads don't cause any issues at all, other times they crash the cluster.

Here's our cluster config (everything not listed is default values):

{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "cluster_concurrent_rebalance" : "2",
          "node_concurrent_recoveries" : "2",
          "disk" : {
            "watermark" : {
              "low" : "25.0gb",
              "flood_stage" : "10.0gb",
              "high" : "22.0gb"
            }
          },
          "node_initial_primaries_recoveries" : "4",
          "awareness" : {
            "force" : {
              "zone" : {
                "values" : "us-west-2a,us-west-2b,us-west-2c"
              }
            }
          }
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "60mb"
      }
    }
  },
  "transient" : { }
}

I can post the full config if desired.

Any insight on this would be greatly appreciated! Thank you!

What do you have on the logs of both the master node and the node that left the cluster when this happens?

Master node (es-master03)

Mar 7, 2023 @ 02:08:09.797 [o.e.c.r.a.AllocationService] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]]).
Mar 7, 2023 @ 02:08:09.871 [o.e.c.s.MasterService] node-left[{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]], term: 1485, version: 551291, delta: removed {{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d}}
Mar 7, 2023 @ 02:08:39.879 [o.e.c.s.ClusterApplierService] removed {{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d}}, term: 1485, version: 551291, reason: Publication{term=1485, version=551291}
Mar 7, 2023 @ 02:08:39.889 [o.e.c.r.DelayedAllocationService] scheduling reroute for delayed shards in [29.8s] (684 delayed shards)
Mar 7, 2023 @ 02:08:39.890 [o.e.c.c.C.CoordinatorPublication] after [30s] publication of cluster state version [551291] is still waiting for {es02}{BTN9G0zrRCmVNDqi-6PwOA}{D97hwpq6Q52E1F5QGqub9A}{172.31.40.211}{172.31.40.211:9300}{d}{xpack.installed=true, transform.node=false} [SENT_APPLY_COMMIT]
Mar 7, 2023 @ 02:08:43.028 [o.e.c.r.a.AllocationService] [data.2023-03][3] marking unavailable shards as stale: [7O0CEkT8TTieKBXOaPdqCA]
Mar 7, 2023 @ 02:08:43.085 [o.e.c.a.s.ShardStateAction] [data.2023-03][3] unexpected failure while failing shard [shard id [[data.2023-03][3]], allocation id [N4qaOAN6SSOjeCOBdqExoQ], primary term [13], message [failed to perform indices:data/write/bulk[s] on replica [data.2023-03][3], node[ONWF5whDQZeg4GAlC8tWWQ], [R], s[STARTED], a[id=N4qaOAN6SSOjeCOBdqExoQ]], failure [RemoteTransportException[[es01][172.31.31.241:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[data.2023-03][3] operation primary term [13] is too old (current [14])]; ], markAsStale [true]]
    org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [13] did not match current primary term [14],
    at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:459) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:840) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:407) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:243) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.access$100(MasterService.java:63) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:146) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:202) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.9.jar:7.17.9],
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
    at java.lang.Thread.run(Thread.java:1589) [?:?]
Mar 7, 2023 @ 02:09:00.080 [o.e.t.OutboundHandler] send message failed [channel: Netty4TcpChannel{localAddress=/172.31.3.160:32916, remoteAddress=172.31.30.196/172.31.30.196:9300, profile=default}]
	io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms,
	at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2114) [netty-handler-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [netty-transport-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at java.lang.Thread.run(Thread.java:1589) [?:?]
Mar 7, 2023 @ 02:09:00.081 [o.e.t.OutboundHandler] sending transport message [Request{internal:tcp/handshake}{10764171}{false}{false}{true}] of size [58] on [Netty4TcpChannel{localAddress=/172.31.3.160:32916, remoteAddress=172.31.30.196/172.31.30.196:9300, profile=default}] took [10003ms] which is above the warn threshold of [5000ms] with success [false]
Mar 7, 2023 @ 02:09:02.912 [o.e.c.InternalClusterInfoService] failed to retrieve stats for node [CAGyRNZkTl63DLjT-FlgeQ]: [es03][172.31.4.198:9300][cluster:monitor/nodes/stats[n]] request_id [10764152] timed out after [15005ms]
Mar 7, 2023 @ 02:09:19.824 [o.e.c.c.C.CoordinatorPublication] after [10s] publication of cluster state version [551294] is still waiting for {es02}{BTN9G0zrRCmVNDqi-6PwOA}{D97hwpq6Q52E1F5QGqub9A}{172.31.40.211}{172.31.40.211:9300}{d}{xpack.installed=true, transform.node=false} [SENT_APPLY_COMMIT], {es03}{CAGyRNZkTl63DLjT-FlgeQ}{6NWs_bEtQSedW2FHoEyIZw}{172.31.4.198}{172.31.4.198:9300}{d}{xpack.installed=true, transform.node=false} [SENT_PUBLISH_REQUEST]
Mar 7, 2023 @ 02:09:39.831 [o.e.c.a.s.ShardStateAction] [data.2023-02][1] unexpected failure while failing shard [shard id [[data.2023-02][1]], allocation id [jEvzhDmcRR2OxGpFa_4IrQ], primary term [42], message [failed to perform indices:data/write/bulk[s] on replica [data.2023-02][1], node[CAGyRNZkTl63DLjT-FlgeQ], [R], s[STARTED], a[id=jEvzhDmcRR2OxGpFa_4IrQ]], failure [RemoteTransportException[[es03][172.31.4.198:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[data.2023-02][1] operation primary term [42] is too old (current [43])]; ], markAsStale [true]]
	org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [42] did not match current primary term [43],
	at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:459) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:840) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:407) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:243) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.access$100(MasterService.java:63) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:146) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:202) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?]

Data node es02:

Mar 7, 2023 @ 02:07:31.123 [o.e.a.s.TransportSearchAction] [ONWF5whDQZeg4GAlC8tWWQ][cards.ts.2022-11][0]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cards.ts.2023-03, cards.ts.2023-02, cards.ts.2023-01, cards.ts.2022-09, cards.ts.2022-08, cards.ts.2022-07, cards.ts.2022-06, cards.ts.2022-05, cards.ts.2022-04, cards.ts.2022-03, cards.ts.2022-02, cards.ts.2022-01, cards.ts.2022-12, cards.ts.2022-11, cards.ts.2022-10, logs.2023-01-28, logs.2023-01-27, logs.2023-01-26, logs.2023-01-29, cards.ts.2020-12, logs.2023-01-31, logs.2023-01-30, cards.ts.2020-10, cards.ts.2020-11, logs.2023-02-03, logs.2023-02-02, logs.2023-02-01, cards.ts.2021-08, cards.ts.2021-09, cards.ts.2021-06, cards.ts.2021-07, cards.ts.2021-04, cards.ts.2021-05, cards.ts.2021-02, cards.ts.2021-03, cards.ts.2021-01, cards.ts.2021-11, cards.ts.2021-12, cards.ts.2021-10, cards.ts.2019-09, cards, cards.ts.2019-05, cards.ts.2019-06, cards.ts.2019-08, cards.ts.2019-04, cards.ts.2020-09, cards.ts.2020-07, cards.ts.2020-08, cards.ts.2020-05, cards.ts.2020-06, cards.ts.2020-03, cards.ts.2020-04, cards.ts.2019-12, cards.ts.2019-10, cards.ts.2019-11, cards.ts.2020-01, cards.ts.2020-02, logs.2023-02-18, logs.2023-02-17, logs.2023-02-16, logs.2023-02-15, logs.2023-02-19, logs.2023-02-21, logs.2023-02-20, logs.2023-02-24, logs.2023-02-23, logs.2023-02-22, logs.2023-02-07, logs.2023-02-06, logs.2023-02-05, logs.2023-02-04, logs.2023-02-09, logs.2023-02-08, logs.2023-02-10, logs.2023-02-14, logs.2023-02-13, logs.2023-02-12, logs.2023-02-11], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, enableFieldsEmulation=false, source={"from":0,"size":0,"query":{"bool":{"filter":[{"term":{"type":{"value":"purchases","boost":1.0}}},{"range":{"json_obj.global.date":{"from":null,"to":"now-0w/w","include_lower":true,"include_upper":false,"boost":1.0}}},{"exists":{"field":"json_obj.purchase_content.quantity","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.unit_price","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.product_name","boost":1.0}},{"range":{"json_obj.purchase_content.quantity":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.unit_price":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.basket_total":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"term":{"json_obj.purchase_content.gtin":{"value":"00028400372169","boost":1.0}}},{"term":{"json_obj.purchases.purchases.banner_id":{"value":"vons","boost":1.0}}},{"terms":{"entity_id":["164506"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"week":{"date_histogram":{"field":"json_obj.global.date","calendar_interval":"week","offset":-86400000,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"price":{"avg":{"field":"json_obj.purchase_content.unit_price"}},"sales":{"sum":{"field":"json_obj.purchase_content.total_sales"}}}}}}}] lastShard [false]
	org.elasticsearch.transport.RemoteTransportException: [es01][172.31.31.241:9300][indices:data/read/search[phase/query]], Caused by: org.elasticsearch.tasks.TaskCancelledException: task cancelled [http channel [Netty4HttpChannel{localAddress=/172.31.40.211:9200, remoteAddress=/172.31.9.146:6692}] closed],
	at org.elasticsearch.tasks.CancellableTask.getTaskCancelledException(CancellableTask.java:121) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.tasks.CancellableTask.ensureNotCancelled(CancellableTask.java:97) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.checkCancelled(SearchService.java:811) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:976) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:618) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?]
x20

Data node es04:

Mar 7, 2023 @ 02:08:43.072 [o.e.t.InboundHandler] handling response [InboundMessage{Header{11953}{7.17.9}{41701144}{false}{true}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] on handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$4/[indices:data/write/bulk[s][r]]:org.elasticsearch.action.ActionListenerResponseHandler@3abf545d/org.elasticsearch.action.support.RetryableAction$RetryingListener@6ce059d2] took [15205ms] which is above the warn threshold of [5000ms]
Mar 7, 2023 @ 02:08:43.073 [o.e.a.s.TransportSearchAction] [BTN9G0zrRCmVNDqi-6PwOA][cards.ts.2021-12][2]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cards.ts.2023-03, cards.ts.2023-02, cards.ts.2023-01, cards.ts.2022-09, cards.ts.2022-08, cards.ts.2022-07, cards.ts.2022-06, cards.ts.2022-05, cards.ts.2022-04, cards.ts.2022-03, cards.ts.2022-02, cards.ts.2022-01, cards.ts.2022-12, cards.ts.2022-11, cards.ts.2022-10, logs.2023-01-28, logs.2023-01-27, logs.2023-01-26, logs.2023-01-29, cards.ts.2020-12, logs.2023-01-31, logs.2023-01-30, cards.ts.2020-10, cards.ts.2020-11, logs.2023-02-03, logs.2023-02-02, logs.2023-02-01, cards.ts.2021-08, cards.ts.2021-09, cards.ts.2021-06, cards.ts.2021-07, cards.ts.2021-04, cards.ts.2021-05, cards.ts.2021-02, cards.ts.2021-03, cards.ts.2021-01, cards.ts.2021-11, cards.ts.2021-12, cards.ts.2021-10, cards.ts.2019-09, cards, cards.ts.2019-05, cards.ts.2019-06, cards.ts.2019-08, cards.ts.2019-04, cards.ts.2020-09, cards.ts.2020-07, cards.ts.2020-08, cards.ts.2020-05, cards.ts.2020-06, cards.ts.2020-03, cards.ts.2020-04, cards.ts.2019-12, cards.ts.2019-10, cards.ts.2019-11, cards.ts.2020-01, cards.ts.2020-02, logs.2023-02-18, logs.2023-02-17, logs.2023-02-16, logs.2023-02-15, logs.2023-02-19, logs.2023-02-21, logs.2023-02-20, logs.2023-02-24, logs.2023-02-23, logs.2023-02-22, logs.2023-02-07, logs.2023-02-06, logs.2023-02-05, logs.2023-02-04, logs.2023-02-09, logs.2023-02-08, logs.2023-02-10, logs.2023-02-14, logs.2023-02-13, logs.2023-02-12, logs.2023-02-11], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, enableFieldsEmulation=false, source={"from":0,"size":0,"query":{"bool":{"filter":[{"term":{"type":{"value":"purchases","boost":1.0}}},{"range":{"json_obj.global.date":{"from":null,"to":"now-0w/w","include_lower":true,"include_upper":false,"boost":1.0}}},{"exists":{"field":"json_obj.purchase_content.quantity","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.unit_price","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.product_name","boost":1.0}},{"range":{"json_obj.purchase_content.quantity":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.unit_price":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.basket_total":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"term":{"json_obj.purchase_content.gtin":{"value":"00028400516686","boost":1.0}}},{"term":{"json_obj.purchases.purchases.banner_id":{"value":"vons","boost":1.0}}},{"terms":{"entity_id":["164506"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"week":{"date_histogram":{"field":"json_obj.global.date","calendar_interval":"week","offset":-86400000,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"price":{"avg":{"field":"json_obj.purchase_content.unit_price"}},"sales":{"sum":{"field":"json_obj.purchase_content.total_sales"}}}}}}}] lastShard [true]
    org.elasticsearch.transport.SendRequestTransportException: [es02][172.31.40.211:9300][indices:data/read/search[phase/query]],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:881) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:843) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:350) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:505) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:64) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:331) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:651) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:883) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:843) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:350) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:505) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:64) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:331) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:651) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:883) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:376) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?], Caused by: org.elasticsearch.tasks.TaskCancelledException: parent task was cancelled [http channel [Netty4HttpChannel{localAddress=/172.31.30.196:9200, remoteAddress=/172.31.37.249:1154}] closed],
	at org.elasticsearch.tasks.TaskManager$CancellableTaskHolder.registerChildConnection(TaskManager.java:579) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.tasks.TaskManager.registerChildConnection(TaskManager.java:213) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:842) ~[elasticsearch-7.17.9.jar:7.17.9],
	... 41 more
x4
Mar 7, 2023 @ 02:08:43.074 [o.e.t.OutboundHandler] sending transport message [Request{indices:data/read/search[free_context]}{41696882}{false}{false}{false}] of size [435] on [Netty4TcpChannel{localAddress=/172.31.30.196:49620, remoteAddress=172.31.31.241/172.31.31.241:9300, profile=default}] took [66220ms] which is above the warn threshold of [5000ms] with success [true]
x3
Mar 7, 2023 @ 02:08:43.075 [o.e.a.b.TransportShardBulkAction] [[data.2023-03][3]] failed to perform indices:data/write/bulk[s] on replica [data.2023-03][3], node[ONWF5whDQZeg4GAlC8tWWQ], [R], s[STARTED], a[id=N4qaOAN6SSOjeCOBdqExoQ]

The 8.x reference manual has some troubleshooting guidance here which basically applies in 7.x too although the details may be a little different. The very last paragraph is the important one in your case I think: this looks like either network saturation or over-busy network threads.

1 Like

Thanks, you pointed me in the right direction.

The issue wasn't the GC itself, but Docker was shipping logs in blocking mode and our receiver was overloaded and caused pauses when logging to STDOUT, leading to timeouts in clustering, leading to pandemonium.

Ultimately, the fix was to set --log-opt mode=non-blocking --log-opt max-buffer-size=4m in the container running the ES nodes.

Thanks for the help!

1 Like

Great work. Yes, that would definitely explain it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.