Shards getting marked as stale frequently causing cluster to go yellow

We use elasticsearch 7.2. It's a 11 node cluster with around 2TB of data sharded across 30 shards
We use G1GC & have fixed this https://github.com/elastic/elasticsearch/pull/46169

Lately we've been seeing unnecessary circuit breaker tripping (false positive alert) even though the JVM heap used was merely in 18g. I think the below exception says that it marked the shard in it as stale markAsStale=true which i guess explains the cluster going yellow

So my question is, what do I do about the circuit breaker? Is there any more G1GC config that I need to add into the jvm.options? Why is the CB being reported a false usage?

NOTE: as a hack i constantly was hitting the cluster with _search API hoping GC happens & circuit never tripped after this!

[2020-05-08T21:40:19,555][WARN ][o.e.c.r.a.AllocationService] [es7advcl02-01] failing shard [failed shard, shard [issues-index][12], node[6vaOs45ISuuLYCqPE68-Rg], [R], s[STARTED], a[id=uFhWt4zcQsCctmOJD9yXkg], message [failed to perform indices:data/write/bulk[s] on replica [issues-index][12], node[6vaOs45ISuuLYCqPE68-Rg], [R], s[STARTED], a[id=uFhWt4zcQsCctmOJD9yXkg]], failure [RemoteTransportException[[es7advcl02-16][10.0.7.159:9301][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30640872138/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30640869176/28.5gb], new bytes reserved: [2962/2.8kb]]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [es7advcl02-16][10.0.7.159:9301][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30640872138/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30640869176/28.5gb], new bytes reserved: [2962/2.8kb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-05-08T21:40:19,559][INFO ][o.e.c.r.a.AllocationService] [es7advcl02-01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[issues-index][12]] ...]).
[2020-05-08T21:40:19,582][WARN ][o.e.g.G.InternalReplicaShardAllocator] [es7advcl02-01] [issues-index][12]: failed to list shard for shard_store on node [6vaOs45ISuuLYCqPE68-Rg]
org.elasticsearch.action.FailedNodeException: Failed node [6vaOs45ISuuLYCqPE68-Rg]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:223) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:142) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:198) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.2.0.jar:7.2.0]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.35.Final.jar:4.1.35.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.35.Final.jar:4.1.35.Final]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [es7advcl02-16][10.0.7.159:9301][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30724755590/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30724755256/28.6gb], new bytes reserved: [334/334b]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]

The JVM is telling Elasticsearch that 28.5GB of the heap is used. Can you explain why you think the heap used was actually only 18GB?

We plot the jvm.mem.heap_used_in_bytes from _nodes/stats. That showed the usage as 18gb only. Is it wrong to rely on it?

The stats also come from the JVM, but maybe there's a spike in heap usage that your stats monitoring is missing. The GC log is more reliable, I'd check that instead.

Surely, will do that as well

One thing I was able to correlate this issue with is spam. Lately we've been spammed that has caused our ES documents to go as huge as 5-10mb. The updates on these docs become very slow as they increase. We are already deploying a fix for this. ES does not impose a limit on doc size right?

So I went through our GC logs and I've noticed a couple of to-space exhausted logs. You were also right about our heap usage indeed going upto 30g hence the circuit breaking

When the to-space exhausted happened, we had around 1400 regions dedicated to humongous objects. Our heap region size is 8mb (for a 30gb cluster)
As I mentioned we have a couple of documents with string fields that have gone upto 1-4mb in size (spam documents). We are working on truncating such large strings.
Our documents go upto 5mb and any update on them goes beyond 10s. So its most likely that such large strings are also long lived for a while

We are also working towards increasing heap region size to 16 but i'm unsure if JVM will accept that (there won't be 2048 regions available with a size of 16)

@Faiz_Ahmed_Mushtak_H, would it be possible to share the GC log file for this? Feel free to share it privately only if you prefer.

A few other users have reported similar issues, some have had success by increasing G1ReservePercent slightly (like to 30 or 35). It might be more of a workaround than a solution.

The symptoms I have seen a few times is that a large amount of primarily humongous objects are allocated and this causes the heap usage to climb above the circuit breaker limit.

AFAIR you should be able to freely modify the region size up to 32MB, but tweaking this could cause other GC related issues. It might be worth trying out though. I think you get 8MB regions with your current heap and using 16MB is not too far from the default 2048 regions.

How do I share it to you privately? Your email ID? I was trying to PM you here but it only allows me to update images and not files

I'd be happy to know what is the impact of increasing heap region size. I was planning to just go upto 16mb

I PM'ed you an upload target to use.

Uploaded

This does have similar symptoms to prior incidents, seeing a period of no GC activity leading to high heap usage (highest was 30708M). Many humongous allocations in the period. I think To-space exhaustion is more of a symptom than a cause (but I could be wrong).

Can you elaborate on the activity against the node? We are eager to collect as much info on this to guide us in getting to the bottom of this.

Also, if you carry out any of the experiments mentioned (region size and reserve pct), we would appreciate getting feedback on the results here.

Our cluster as of now is only being ingested into (index + updates)
At the time of the incident, we saw a lot of queuing up (in ES write threadpool) and indexing latency on this node was badly affected

As I mentioned about our docs going beyond 4mb. We use scripting to append strings to a field. As I mentioned these strings coming in are of 1-2mb. So as the JSON grows, so does the update API call latency

Lately we've been seeing our merges also go upto 10minutes (not sure if related but just writing it down)

@Faiz_Ahmed_Mushtak_H

Thanks for the use case details. A few follow-up questions:

  1. Does this mean that there is no search involved (except for the searches done internally by index + update)?
  2. About the very large docs, how many of those are you either indexing or updating per second? If you know how many you touched in the 17 seconds before the circuit breaker exception we could maybe either rule that part out or in?
  3. How many documents do you have per shard? Would you be willing to share the output of GET /_stats?level=shards?
  1. Not really. We are in the migration phase. So once we know the cluster is good for real, we'll switch our search traffic to 7.x cluster. Current we are using 1.7
  2. Currently it's hard to trace this without date timestamps in our GC log. But whenever such documents get indexed, they get updated say 10-20 times (let's say 1-2 such documents get created and there are lots of updates on them, so roughly ~40-60 requests in total that are huge). The problematic field in our case is an array. As I mentioned, the large message gets appended into this array via painless script. So you can imagine around 10-20 times that huge message was created in the JVM ( not sure how useful string deduplication is helping us)
  3. Sure. We have 30 shards spread across 8 nodes (8core, 64GB RAM) -- we expanded the cluster to 11 nodes in between but no lock with the circuit breaker (which kinda rules out capacity issue). (30 shards was kept with the recommendation that a shard should be well below 50gb. Shards right now are 32gb in size

The output was huge so i did a filter_path, hope that is ok

{
  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 219940333,
        "deleted" : 23290696
      }
    },
    "total" : {
      "docs" : {
        "count" : 439880666,
        "deleted" : 45807613
      }
    }
  },
  "indices" : {
    "issues-index" : {
      "primaries" : {
        "docs" : {
          "count" : 219940333,
          "deleted" : 23290696
        }
      },
      "total" : {
        "docs" : {
          "count" : 439880666,
          "deleted" : 45807613
        }
      },
      "shards" : {
        "0" : [
          {
            "docs" : {
              "count" : 7330167,
              "deleted" : 834513
            }
          },
          {
            "docs" : {
              "count" : 7330167,
              "deleted" : 786231
            }
          }
        ],
        "1" : [
          {
            "docs" : {
              "count" : 7328121,
              "deleted" : 712490
            }
          },
          {
            "docs" : {
              "count" : 7328121,
              "deleted" : 641326
            }
          }
        ],
        "2" : [
          {
            "docs" : {
              "count" : 7330290,
              "deleted" : 596492
            }
          },
          {
            "docs" : {
              "count" : 7330290,
              "deleted" : 648496
            }
          }
        ],
        "3" : [
          {
            "docs" : {
              "count" : 7330738,
              "deleted" : 833540
            }
          },
          {
            "docs" : {
              "count" : 7330738,
              "deleted" : 755969
            }
          }
        ],
        "4" : [
          {
            "docs" : {
              "count" : 7330505,
              "deleted" : 911492
            }
          },
          {
            "docs" : {
              "count" : 7330505,
              "deleted" : 909230
            }
          }
        ],
        "5" : [
          {
            "docs" : {
              "count" : 7330210,
              "deleted" : 779679
            }
          },
          {
            "docs" : {
              "count" : 7330210,
              "deleted" : 759257
            }
          }
        ],
        "6" : [
          {
            "docs" : {
              "count" : 7328490,
              "deleted" : 843540
            }
          },
          {
            "docs" : {
              "count" : 7328490,
              "deleted" : 869217
            }
          }
        ],
        "7" : [
          {
            "docs" : {
              "count" : 7332638,
              "deleted" : 684192
            }
          },
          {
            "docs" : {
              "count" : 7332638,
              "deleted" : 657901
            }
          }
        ],
        "8" : [
          {
            "docs" : {
              "count" : 7329587,
              "deleted" : 804609
            }
          },
          {
            "docs" : {
              "count" : 7329587,
              "deleted" : 773543
            }
          }
        ],
        "9" : [
          {
            "docs" : {
              "count" : 7333567,
              "deleted" : 928882
            }
          },
          {
            "docs" : {
              "count" : 7333567,
              "deleted" : 841566
            }
          }
        ],
        "10" : [
          {
            "docs" : {
              "count" : 7324284,
              "deleted" : 660736
            }
          },
          {
            "docs" : {
              "count" : 7324284,
              "deleted" : 661937
            }
          }
        ],
        "11" : [
          {
            "docs" : {
              "count" : 7333462,
              "deleted" : 545285
            }
          },
          {
            "docs" : {
              "count" : 7333462,
              "deleted" : 754873
            }
          }
        ],
        "12" : [
          {
            "docs" : {
              "count" : 7333608,
              "deleted" : 688299
            }
          },
          {
            "docs" : {
              "count" : 7333608,
              "deleted" : 644363
            }
          }
        ],
        "13" : [
          {
            "docs" : {
              "count" : 7334405,
              "deleted" : 659105
            }
          },
          {
            "docs" : {
              "count" : 7334405,
              "deleted" : 573698
            }
          }
        ],
        "14" : [
          {
            "docs" : {
              "count" : 7330523,
              "deleted" : 834984
            }
          },
          {
            "docs" : {
              "count" : 7330523,
              "deleted" : 839189
            }
          }
        ],
        "15" : [
          {
            "docs" : {
              "count" : 7325347,
              "deleted" : 728773
            }
          },
          {
            "docs" : {
              "count" : 7325347,
              "deleted" : 682688
            }
          }
        ],
        "16" : [
          {
            "docs" : {
              "count" : 7326812,
              "deleted" : 802271
            }
          },
          {
            "docs" : {
              "count" : 7326812,
              "deleted" : 792629
            }
          }
        ],
        "17" : [
          {
            "docs" : {
              "count" : 7331249,
              "deleted" : 733518
            }
          },
          {
            "docs" : {
              "count" : 7331249,
              "deleted" : 748165
            }
          }
        ],
        "18" : [
          {
            "docs" : {
              "count" : 7336129,
              "deleted" : 760912
            }
          },
          {
            "docs" : {
              "count" : 7336129,
              "deleted" : 822600
            }
          }
        ],
        "19" : [
          {
            "docs" : {
              "count" : 7334269,
              "deleted" : 910058
            }
          },
          {
            "docs" : {
              "count" : 7334269,
              "deleted" : 715469
            }
          }
        ],
        "20" : [
          {
            "docs" : {
              "count" : 7332589,
              "deleted" : 725276
            }
          },
          {
            "docs" : {
              "count" : 7332589,
              "deleted" : 793994
            }
          }
        ],
        "21" : [
          {
            "docs" : {
              "count" : 7331603,
              "deleted" : 702731
            }
          },
          {
            "docs" : {
              "count" : 7331603,
              "deleted" : 671716
            }
          }
        ],
        "22" : [
          {
            "docs" : {
              "count" : 7333134,
              "deleted" : 848535
            }
          },
          {
            "docs" : {
              "count" : 7333134,
              "deleted" : 881696
            }
          }
        ],
        "23" : [
          {
            "docs" : {
              "count" : 7335950,
              "deleted" : 930424
            }
          },
          {
            "docs" : {
              "count" : 7335950,
              "deleted" : 956261
            }
          }
        ],
        "24" : [
          {
            "docs" : {
              "count" : 7332682,
              "deleted" : 751757
            }
          },
          {
            "docs" : {
              "count" : 7332682,
              "deleted" : 771193
            }
          }
        ],
        "25" : [
          {
            "docs" : {
              "count" : 7333159,
              "deleted" : 840811
            }
          },
          {
            "docs" : {
              "count" : 7333159,
              "deleted" : 737667
            }
          }
        ],
        "26" : [
          {
            "docs" : {
              "count" : 7333133,
              "deleted" : 651475
            }
          },
          {
            "docs" : {
              "count" : 7333133,
              "deleted" : 634072
            }
          }
        ],
        "27" : [
          {
            "docs" : {
              "count" : 7329817,
              "deleted" : 850448
            }
          },
          {
            "docs" : {
              "count" : 7329817,
              "deleted" : 843076
            }
          }
        ],
        "28" : [
          {
            "docs" : {
              "count" : 7332782,
              "deleted" : 845423
            }
          },
          {
            "docs" : {
              "count" : 7332782,
              "deleted" : 832494
            }
          }
        ],
        "29" : [
          {
            "docs" : {
              "count" : 7331083,
              "deleted" : 691716
            }
          },
          {
            "docs" : {
              "count" : 7331083,
              "deleted" : 715131
            }
          }
        ]
      }
    }
  }
}

FYI, let me tell you

Recently we indexed all of our data into this cluster. The cluster was able to sustain around 200k doc ingestion per minute without showing any concerns. It's only when we brought in production traffic we started facing this issue (and our production traffic does doc update at a rate of 8k per minute! that's peanuts compared to 200k!)

As far as I recall the behaviour of updates have changed since Elasticsearch 1.7, and updating the same document frequently can lead to a lot of small refreshes (which are expensive) and small segments as a result. If you happen to get a few hot documents receiving lots of updates it seems like you might suffer from this.

I didn't really see a spike in index refreshes (but i wouldn't rule this out). My guess is the heap is choked for some reason. Typically I expect the GC to follow a sawtooth pattern, but it is so chaotic in my case

In a day or two we are going to fix a couple of things

  1. dedupe the array (to reduce json size -- we see the msgs are repeated in our case)
  2. truncate msgs to 10k characters (btw, lucene during tokenization ignores characters beyond 8192 right?)

Let's see the behaviour then. Happy to have discussion nevertheless

Thanks for the stats. The doc counts are not big enough that the underlying memory structures should turn into humongous objects with region size 8MB.

The update code path does seem like it could provoke humongous allocations if the resulting doc is larger than half the region size (4MB). String dedupe will not help anything here, since it is not a string at this time.

You might be able to avoid that by doing the update outside Elasticsearch instead (using optimistic concurrency control to safely index the resulting document). But if you can reduce the size by deduping your array, I would think that is the first thing to try out.