ES persistent outages

orthecreedence · March 9, 2023, 8:59pm

Hello. We recently upgraded to ES 7.17.9 (8.x is on the radar, but we have a lot of reindexing to do before then) and are having a lot of problems. We're using a fairly stock configuration on EC2.

Our setup consists of four c6id.8xlarge data nodes (32 cores, 64gb ram) for a total of 128 cores / 256 gb ram all using 100% NVME storage. We have three c6i.2xlarge master nodes (8 cores, 16gb). So 7 nodes total.

We have 47 indices over 604 primary shards, with 1812 total shards (prim + rep). We have 1089185044 total docs over ~1.8tb (yes, we are over-sharded). Many of our indices were created on old versions:

21 on 6.8
16 on 7.10

We're seeing an issue where when certain queries that run against all indices the logs show a lot of:

handling response [InboundMessage{Header{84861}{7.17.9}{15773635}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] on handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$4/[indices:data/read/search[phase/query]]:org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler@7e88bfb/org.elasticsearch.action.search.SearchExecutionStatsCollector/org.elasticsearch.action.search.AbstractSearchAsyncAction$1@7cf41599] took [26208ms] which is above the warn threshold of [5000ms]

and

sending transport message [Request{indices:admin/seq_no/retention_lease_background_sync[r]}{15773899}{false}{false}{false}] of size [535] on [Netty4TcpChannel{localAddress=/172.31.31.241:35100, remoteAddress=172.31.4.198/172.31.4.198:9300, profile=default}] took [23207ms] which is above the warn threshold of [5000ms] with success [true]

What follows afterwards is one or more data nodes drop from the cluster, and they do not rejoin until hours (sometimes 6 or 8 hours) later. Leading up to (And during) the outages, the CPU, memory, network, and disk metrics are all low...there's no obvious resource that seems to be starved. The only way to remedy this quickly is a rolling restart. Another odd symptom we've see is that queued search and search_coordination threads climb steadily while this is happening.

On version 5.x we saw issues where large queries over all our indices would cause the cluster to overload for a few minutes, but it would eventually come back. But what's happening now seems to be making the cluster extremely fragile in almost unpredictable ways: sometimes the same workloads don't cause any issues at all, other times they crash the cluster.

Here's our cluster config (everything not listed is default values):

{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "cluster_concurrent_rebalance" : "2",
          "node_concurrent_recoveries" : "2",
          "disk" : {
            "watermark" : {
              "low" : "25.0gb",
              "flood_stage" : "10.0gb",
              "high" : "22.0gb"
            }
          },
          "node_initial_primaries_recoveries" : "4",
          "awareness" : {
            "force" : {
              "zone" : {
                "values" : "us-west-2a,us-west-2b,us-west-2c"
              }
            }
          }
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "60mb"
      }
    }
  },
  "transient" : { }
}

I can post the full config if desired.

Any insight on this would be greatly appreciated! Thank you!

leandrojmp · March 9, 2023, 9:10pm

What do you have on the logs of both the master node and the node that left the cluster when this happens?

orthecreedence · March 9, 2023, 9:59pm

Master node (es-master03)

Mar 7, 2023 @ 02:08:09.797 [o.e.c.r.a.AllocationService] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]]).
Mar 7, 2023 @ 02:08:09.871 [o.e.c.s.MasterService] node-left[{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]], term: 1485, version: 551291, delta: removed {{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d}}
Mar 7, 2023 @ 02:08:39.879 [o.e.c.s.ClusterApplierService] removed {{es04}{eYtKS0IsQhS5n1ZSWgXfZg}{fsBpmwUlS-6yKj5L8sUU-w}{172.31.30.196}{172.31.30.196:9300}{d}}, term: 1485, version: 551291, reason: Publication{term=1485, version=551291}
Mar 7, 2023 @ 02:08:39.889 [o.e.c.r.DelayedAllocationService] scheduling reroute for delayed shards in [29.8s] (684 delayed shards)
Mar 7, 2023 @ 02:08:39.890 [o.e.c.c.C.CoordinatorPublication] after [30s] publication of cluster state version [551291] is still waiting for {es02}{BTN9G0zrRCmVNDqi-6PwOA}{D97hwpq6Q52E1F5QGqub9A}{172.31.40.211}{172.31.40.211:9300}{d}{xpack.installed=true, transform.node=false} [SENT_APPLY_COMMIT]
Mar 7, 2023 @ 02:08:43.028 [o.e.c.r.a.AllocationService] [data.2023-03][3] marking unavailable shards as stale: [7O0CEkT8TTieKBXOaPdqCA]
Mar 7, 2023 @ 02:08:43.085 [o.e.c.a.s.ShardStateAction] [data.2023-03][3] unexpected failure while failing shard [shard id [[data.2023-03][3]], allocation id [N4qaOAN6SSOjeCOBdqExoQ], primary term [13], message [failed to perform indices:data/write/bulk[s] on replica [data.2023-03][3], node[ONWF5whDQZeg4GAlC8tWWQ], [R], s[STARTED], a[id=N4qaOAN6SSOjeCOBdqExoQ]], failure [RemoteTransportException[[es01][172.31.31.241:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[data.2023-03][3] operation primary term [13] is too old (current [14])]; ], markAsStale [true]]
    org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [13] did not match current primary term [14],
    at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:459) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:840) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:407) ~[elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:243) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService.access$100(MasterService.java:63) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:146) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:202) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.9.jar:7.17.9],
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.9.jar:7.17.9],
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
    at java.lang.Thread.run(Thread.java:1589) [?:?]
Mar 7, 2023 @ 02:09:00.080 [o.e.t.OutboundHandler] send message failed [channel: Netty4TcpChannel{localAddress=/172.31.3.160:32916, remoteAddress=172.31.30.196/172.31.30.196:9300, profile=default}]
	io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms,
	at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2114) [netty-handler-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [netty-transport-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.66.Final.jar:4.1.66.Final],
	at java.lang.Thread.run(Thread.java:1589) [?:?]
Mar 7, 2023 @ 02:09:00.081 [o.e.t.OutboundHandler] sending transport message [Request{internal:tcp/handshake}{10764171}{false}{false}{true}] of size [58] on [Netty4TcpChannel{localAddress=/172.31.3.160:32916, remoteAddress=172.31.30.196/172.31.30.196:9300, profile=default}] took [10003ms] which is above the warn threshold of [5000ms] with success [false]
Mar 7, 2023 @ 02:09:02.912 [o.e.c.InternalClusterInfoService] failed to retrieve stats for node [CAGyRNZkTl63DLjT-FlgeQ]: [es03][172.31.4.198:9300][cluster:monitor/nodes/stats[n]] request_id [10764152] timed out after [15005ms]
Mar 7, 2023 @ 02:09:19.824 [o.e.c.c.C.CoordinatorPublication] after [10s] publication of cluster state version [551294] is still waiting for {es02}{BTN9G0zrRCmVNDqi-6PwOA}{D97hwpq6Q52E1F5QGqub9A}{172.31.40.211}{172.31.40.211:9300}{d}{xpack.installed=true, transform.node=false} [SENT_APPLY_COMMIT], {es03}{CAGyRNZkTl63DLjT-FlgeQ}{6NWs_bEtQSedW2FHoEyIZw}{172.31.4.198}{172.31.4.198:9300}{d}{xpack.installed=true, transform.node=false} [SENT_PUBLISH_REQUEST]
Mar 7, 2023 @ 02:09:39.831 [o.e.c.a.s.ShardStateAction] [data.2023-02][1] unexpected failure while failing shard [shard id [[data.2023-02][1]], allocation id [jEvzhDmcRR2OxGpFa_4IrQ], primary term [42], message [failed to perform indices:data/write/bulk[s] on replica [data.2023-02][1], node[CAGyRNZkTl63DLjT-FlgeQ], [R], s[STARTED], a[id=jEvzhDmcRR2OxGpFa_4IrQ]], failure [RemoteTransportException[[es03][172.31.4.198:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[data.2023-02][1] operation primary term [42] is too old (current [43])]; ], markAsStale [true]]
	org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [42] did not match current primary term [43],
	at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:459) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:840) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:407) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:243) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService.access$100(MasterService.java:63) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:146) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:202) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?]

Data node es02:

Mar 7, 2023 @ 02:07:31.123 [o.e.a.s.TransportSearchAction] [ONWF5whDQZeg4GAlC8tWWQ][cards.ts.2022-11][0]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cards.ts.2023-03, cards.ts.2023-02, cards.ts.2023-01, cards.ts.2022-09, cards.ts.2022-08, cards.ts.2022-07, cards.ts.2022-06, cards.ts.2022-05, cards.ts.2022-04, cards.ts.2022-03, cards.ts.2022-02, cards.ts.2022-01, cards.ts.2022-12, cards.ts.2022-11, cards.ts.2022-10, logs.2023-01-28, logs.2023-01-27, logs.2023-01-26, logs.2023-01-29, cards.ts.2020-12, logs.2023-01-31, logs.2023-01-30, cards.ts.2020-10, cards.ts.2020-11, logs.2023-02-03, logs.2023-02-02, logs.2023-02-01, cards.ts.2021-08, cards.ts.2021-09, cards.ts.2021-06, cards.ts.2021-07, cards.ts.2021-04, cards.ts.2021-05, cards.ts.2021-02, cards.ts.2021-03, cards.ts.2021-01, cards.ts.2021-11, cards.ts.2021-12, cards.ts.2021-10, cards.ts.2019-09, cards, cards.ts.2019-05, cards.ts.2019-06, cards.ts.2019-08, cards.ts.2019-04, cards.ts.2020-09, cards.ts.2020-07, cards.ts.2020-08, cards.ts.2020-05, cards.ts.2020-06, cards.ts.2020-03, cards.ts.2020-04, cards.ts.2019-12, cards.ts.2019-10, cards.ts.2019-11, cards.ts.2020-01, cards.ts.2020-02, logs.2023-02-18, logs.2023-02-17, logs.2023-02-16, logs.2023-02-15, logs.2023-02-19, logs.2023-02-21, logs.2023-02-20, logs.2023-02-24, logs.2023-02-23, logs.2023-02-22, logs.2023-02-07, logs.2023-02-06, logs.2023-02-05, logs.2023-02-04, logs.2023-02-09, logs.2023-02-08, logs.2023-02-10, logs.2023-02-14, logs.2023-02-13, logs.2023-02-12, logs.2023-02-11], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, enableFieldsEmulation=false, source={"from":0,"size":0,"query":{"bool":{"filter":[{"term":{"type":{"value":"purchases","boost":1.0}}},{"range":{"json_obj.global.date":{"from":null,"to":"now-0w/w","include_lower":true,"include_upper":false,"boost":1.0}}},{"exists":{"field":"json_obj.purchase_content.quantity","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.unit_price","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.product_name","boost":1.0}},{"range":{"json_obj.purchase_content.quantity":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.unit_price":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.basket_total":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"term":{"json_obj.purchase_content.gtin":{"value":"00028400372169","boost":1.0}}},{"term":{"json_obj.purchases.purchases.banner_id":{"value":"vons","boost":1.0}}},{"terms":{"entity_id":["164506"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"week":{"date_histogram":{"field":"json_obj.global.date","calendar_interval":"week","offset":-86400000,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"price":{"avg":{"field":"json_obj.purchase_content.unit_price"}},"sales":{"sum":{"field":"json_obj.purchase_content.total_sales"}}}}}}}] lastShard [false]
	org.elasticsearch.transport.RemoteTransportException: [es01][172.31.31.241:9300][indices:data/read/search[phase/query]], Caused by: org.elasticsearch.tasks.TaskCancelledException: task cancelled [http channel [Netty4HttpChannel{localAddress=/172.31.40.211:9200, remoteAddress=/172.31.9.146:6692}] closed],
	at org.elasticsearch.tasks.CancellableTask.getTaskCancelledException(CancellableTask.java:121) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.tasks.CancellableTask.ensureNotCancelled(CancellableTask.java:97) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.checkCancelled(SearchService.java:811) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:976) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:618) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?]
x20

Data node es04:

Mar 7, 2023 @ 02:08:43.072 [o.e.t.InboundHandler] handling response [InboundMessage{Header{11953}{7.17.9}{41701144}{false}{true}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] on handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$4/[indices:data/write/bulk[s][r]]:org.elasticsearch.action.ActionListenerResponseHandler@3abf545d/org.elasticsearch.action.support.RetryableAction$RetryingListener@6ce059d2] took [15205ms] which is above the warn threshold of [5000ms]
Mar 7, 2023 @ 02:08:43.073 [o.e.a.s.TransportSearchAction] [BTN9G0zrRCmVNDqi-6PwOA][cards.ts.2021-12][2]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cards.ts.2023-03, cards.ts.2023-02, cards.ts.2023-01, cards.ts.2022-09, cards.ts.2022-08, cards.ts.2022-07, cards.ts.2022-06, cards.ts.2022-05, cards.ts.2022-04, cards.ts.2022-03, cards.ts.2022-02, cards.ts.2022-01, cards.ts.2022-12, cards.ts.2022-11, cards.ts.2022-10, logs.2023-01-28, logs.2023-01-27, logs.2023-01-26, logs.2023-01-29, cards.ts.2020-12, logs.2023-01-31, logs.2023-01-30, cards.ts.2020-10, cards.ts.2020-11, logs.2023-02-03, logs.2023-02-02, logs.2023-02-01, cards.ts.2021-08, cards.ts.2021-09, cards.ts.2021-06, cards.ts.2021-07, cards.ts.2021-04, cards.ts.2021-05, cards.ts.2021-02, cards.ts.2021-03, cards.ts.2021-01, cards.ts.2021-11, cards.ts.2021-12, cards.ts.2021-10, cards.ts.2019-09, cards, cards.ts.2019-05, cards.ts.2019-06, cards.ts.2019-08, cards.ts.2019-04, cards.ts.2020-09, cards.ts.2020-07, cards.ts.2020-08, cards.ts.2020-05, cards.ts.2020-06, cards.ts.2020-03, cards.ts.2020-04, cards.ts.2019-12, cards.ts.2019-10, cards.ts.2019-11, cards.ts.2020-01, cards.ts.2020-02, logs.2023-02-18, logs.2023-02-17, logs.2023-02-16, logs.2023-02-15, logs.2023-02-19, logs.2023-02-21, logs.2023-02-20, logs.2023-02-24, logs.2023-02-23, logs.2023-02-22, logs.2023-02-07, logs.2023-02-06, logs.2023-02-05, logs.2023-02-04, logs.2023-02-09, logs.2023-02-08, logs.2023-02-10, logs.2023-02-14, logs.2023-02-13, logs.2023-02-12, logs.2023-02-11], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, enableFieldsEmulation=false, source={"from":0,"size":0,"query":{"bool":{"filter":[{"term":{"type":{"value":"purchases","boost":1.0}}},{"range":{"json_obj.global.date":{"from":null,"to":"now-0w/w","include_lower":true,"include_upper":false,"boost":1.0}}},{"exists":{"field":"json_obj.purchase_content.quantity","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.unit_price","boost":1.0}},{"exists":{"field":"json_obj.purchase_content.product_name","boost":1.0}},{"range":{"json_obj.purchase_content.quantity":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.unit_price":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"range":{"json_obj.purchase_content.basket_total":{"from":0,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}},{"term":{"json_obj.purchase_content.gtin":{"value":"00028400516686","boost":1.0}}},{"term":{"json_obj.purchases.purchases.banner_id":{"value":"vons","boost":1.0}}},{"terms":{"entity_id":["164506"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"week":{"date_histogram":{"field":"json_obj.global.date","calendar_interval":"week","offset":-86400000,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0},"aggregations":{"price":{"avg":{"field":"json_obj.purchase_content.unit_price"}},"sales":{"sum":{"field":"json_obj.purchase_content.total_sales"}}}}}}}] lastShard [true]
    org.elasticsearch.transport.SendRequestTransportException: [es02][172.31.40.211:9300][indices:data/read/search[phase/query]],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:881) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:843) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:350) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:505) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:64) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:331) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:651) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:883) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:843) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:350) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:505) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:64) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:331) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:651) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:883) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:937) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:925) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:250) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:104) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:316) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:376) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9],
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?],
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?],
	at java.lang.Thread.run(Thread.java:1589) [?:?], Caused by: org.elasticsearch.tasks.TaskCancelledException: parent task was cancelled [http channel [Netty4HttpChannel{localAddress=/172.31.30.196:9200, remoteAddress=/172.31.37.249:1154}] closed],
	at org.elasticsearch.tasks.TaskManager$CancellableTaskHolder.registerChildConnection(TaskManager.java:579) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.tasks.TaskManager.registerChildConnection(TaskManager.java:213) ~[elasticsearch-7.17.9.jar:7.17.9],
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:842) ~[elasticsearch-7.17.9.jar:7.17.9],
	... 41 more
x4
Mar 7, 2023 @ 02:08:43.074 [o.e.t.OutboundHandler] sending transport message [Request{indices:data/read/search[free_context]}{41696882}{false}{false}{false}] of size [435] on [Netty4TcpChannel{localAddress=/172.31.30.196:49620, remoteAddress=172.31.31.241/172.31.31.241:9300, profile=default}] took [66220ms] which is above the warn threshold of [5000ms] with success [true]
x3
Mar 7, 2023 @ 02:08:43.075 [o.e.a.b.TransportShardBulkAction] [[data.2023-03][3]] failed to perform indices:data/write/bulk[s] on replica [data.2023-03][3], node[ONWF5whDQZeg4GAlC8tWWQ], [R], s[STARTED], a[id=N4qaOAN6SSOjeCOBdqExoQ]

DavidTurner · March 10, 2023, 9:00am

The 8.x reference manual has some troubleshooting guidance here which basically applies in 7.x too although the details may be a little different. The very last paragraph is the important one in your case I think: this looks like either network saturation or over-busy network threads.

orthecreedence · March 22, 2023, 4:17pm

Thanks, you pointed me in the right direction.

The issue wasn't the GC itself, but Docker was shipping logs in blocking mode and our receiver was overloaded and caused pauses when logging to STDOUT, leading to timeouts in clustering, leading to pandemonium.

Ultimately, the fix was to set --log-opt mode=non-blocking --log-opt max-buffer-size=4m in the container running the ES nodes.

Thanks for the help!

DavidTurner · March 22, 2023, 4:22pm

Great work. Yes, that would definitely explain it.

system · April 19, 2023, 4:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	659	July 6, 2017
Very weird ES Cluster state problem! Elasticsearch	8	546	July 6, 2017
Another odd ES freak out Elasticsearch	6	541	July 6, 2017
Cluster Health degraded overnight with no apparent reason Elasticsearch	5	1706	July 6, 2017
Cascading cluster failure Elasticsearch	13	512	July 6, 2017

ES persistent outages

Master node (es-master03)

Data node es02:

Data node es04:

Related topics