Elasticsearch Keeps Crashing shards and data to big

When I restart elasticsearch the cluster will go into green for a little while then it will go back to read and kibana will no longer come up. Instead I see this error " {"message":"all shards failed: [search_phase_execution_exception] all shards failed","statusCode":503,"error":"Service Unavailable"} " Im not sure why this is happening ... it keeps crashing the over and over restarts dont last long.

I see data to large and shards failing but How do I fix this issue

[2020-04-22T15:53:54,901][DEBUG][o.e.a.s.TransportSearchAction] [atl-cla-deves01] All shards failed for phase: [query]
[2020-04-22T15:53:54,902][WARN ][r.suppressed             ] [atl-cla-deves01] path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:305) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:264) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$1(InitialSearchPhase.java:251) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:172) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-04-22T15:53:56,175][WARN ][r.suppressed             ] [atl-cla-deves01] path: /.kibana/_doc/space%3Adefault, params: {index=.kibana, id=space:default}
org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana][_doc][space:default]: routing [null]]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:228) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:205) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:103) [elasticsearch-7.3.0.jar:7.3.0]
[2020-04-22T15:54:17,574][WARN ][o.e.c.r.a.AllocationService] [atl-cla-deves01] [.monitoring-kibana-7-2020.04.22][0] marking unavailable shards as stale: [VOA34f_1Tr6A4evPYstGFQ]
[2020-04-22T15:54:17,574][WARN ][o.e.c.r.a.AllocationService] [atl-cla-deves01] [.monitoring-logstash-7-2020.04.22][0] marking unavailable shards as stale: [vmkB0OVWTtCM29_5fNeCrg]
[2020-04-22T15:54:22,056][INFO ][o.e.c.r.a.AllocationService] [atl-cla-deves01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-7-2020.04.22][0]] ...]).
[2020-04-22T15:54:22,539][WARN ][o.e.x.m.e.l.LocalExporter] [atl-cla-deves01] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: RemoteTransportException[[met-cla-deves03][10.188.0.223:9300][indices:data/write/bulk[s]]]; nested: CircuitBreakingException[[parent] Data too large, 
data for [<transport_request>] would be [8376777120/7.8gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8376769800/7.8gb], new bytes reserved: [7320/7.1kb], usages [request=0/0b, fielddata
=7020/6.8kb, in_flight_requests=7320/7.1kb, accounting=46573684/44.4mb]];
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.3.0.jar:7.3.0]
     org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:473) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:468) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:822) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:780) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.0.jar:7.3.0]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]

Caused by: org.elasticsearch.transport.RemoteTransportException: [met-cla-deves03][10.188.0.223:9300][indices:data/write/bulk[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [8376777120/7.8gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8376769800/7.8gb], new bytes reserved: [7320/7.1kb], usages [request=0/0b, fielddata=7020/6.8kb, in_flight_requests=7320/7.1kb, accounting=46573684/44.4mb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
      io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-04-22T17:23:40,549][INFO ][o.e.c.m.MetaDataCreateIndexService] [atl-cla-deves01] [filebeat-7.3.2-2020.04.22-000008] creating index, cause [rollover_index], templates [filebeat-7.3.2], shards [1]/[1], mappings [_doc]
[2020-04-22T17:23:40,928][INFO ][o.e.c.r.a.AllocationService] [atl-cla-deves01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[filebeat-7.3.2-2020.04.22-000008][0]] ...]).
[2020-04-22T17:50:34,754][WARN ][o.e.m.j.JvmGcMonitorService] [atl-cla-deves01] [gc][7012] overhead, spent [682ms] collecting in the last [1s]
[2020-04-22T17:51:44,119][INFO ][o.e.x.m.p.NativeController] [atl-cla-deves01] Native controller process has stopped - no new native processes can be started

How many nodes, shards, indices do you have?
What size heap on your nodes?

We have just recently added 3 more nodes to make it six, hoping that would help.

The circuit breaker is triggered and blocks few operations in the logs you've shared.

CircuitBreakingException[[parent] Data too large, 
data for [<transport_request>] would be [8376777120/7.8gb],
which is larger than the limit of [8127315968/7.5gb],
real usage: [8376769800/7.8gb],
new bytes reserved: [7320/7.1kb],
usages [request=0/0b,
fielddata
=7020/6.8kb,
in_flight_requests=7320/7.1kb,
accounting=46573684/44.4mb]];

The following error highlight there's JVM Heap usage and Garbage collection is being triggered:

[2020-04-22T17:50:34,754][WARN ][o.e.m.j.JvmGcMonitorService] [atl-cla-deves01] [gc][7012] overhead, spent [682ms] collecting in the last [1s]

Would you please share:

  • The Garbage Collection you're using?
  • The content of your jvm.options file
  • The output of GET _cat/nodes?v&h=version,jdk,heap.*,ram.*,name,fielddata.*,segments.*
  • The output of GET _cat/health?v

If you're using G1GC, it might be a problem which has been identified and fixed in 7.4.1+ (see PR).
The solution, if you're using G1GC, is applicable also on 7.4.0.
You have to edit the jvm.options file to match those settings:

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms12g
-Xmx12g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## optimizations

# disable calls to System#gc
-XX:+DisableExplicitGC

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# force the server VM (remove on 32-bit client JVMs)
-server

# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true

-Dindex.shard.check_on_startup=true

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${heap.dump.path}

## GC logging

#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime

# log GC status to a file with time stamps
# ensure the directory exists
#-Xloggc:${loggc}

# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true```

image

Please format the file using Markdown (use ``` to wrap the text).
It seems you're not using G1GC.

The commands I've shared should be executed on Kibana Dev Tools or via curl.

7.3.0   12.0.1      552.7mb            4   11.9gb      15.3gb          99  15.4gb atl-cla-deves02                13.3kb                   0            739          39.1mb                        3.9mb                          0b                      659.4kb
7.3.0   12.0.1      400.7mb            3   11.9gb      15.3gb          99  15.4gb met-cla-deves01                30.5kb                   0            775            34mb                        1.9mb                          0b                        1.3mb
7.3.0   12.0.1      537.5mb            4   11.9gb      15.2gb          98  15.4gb met-cla-deves02                20.5kb                   0            850          21.2mb                           0b                          0b                      956.5kb
7.3.0   12.0.1        643mb            5   11.9gb      15.3gb          99  15.4gb met-cla-deves03                37.5kb                   0            825          47.6mb                          7mb                          0b                        1.9mb
7.3.0   12.0.1      570.7mb            4   11.9gb      15.2gb          99  15.4gb atl-cla-deves03                18.7kb                   0            636          38.5mb                           0b                          0b                          1mb
7.3.0   12.0.1      978.3mb            7   11.9gb      15.3gb          99  15.4gb atl-cla-deves01                28.6kb                   0            738          31.3mb                       23.4mb                       123kb                        1.6mb
1587734763 13:26:03  cla-dev-elastic7 green           6         6    490 245    0    0        0             0                  -                100.0%

We are not on 7.4 we are on 7.3

My first comment would be that, if I am not wrong, your hosts have 16gb of RAM and you've configured the JVM Heap to 12gb.

We usually recommend to set the JVM Heap at maximum to 50% of the system memory.

Elasticsearch doesn't use only the JVM Heap but it makes use of off-heap memory & file system cache.

Please replace the JVM Heap size in jvm.options to:

-Xms8g
-Xmx8g

Could you also grab:

  • GET _cluster/settings?include_defaults&pretty - I would like to check the circuit breaker settings.
  • GET _cat/shards?v&bytes=b&s=index - I would like to check the size of the shards.
  "message": "Client request error: connect ECONNREFUSED 10.88.0.221:9500",
  "statusCode": 502,
  "error": "Bad Gateway"
}
{
  "message": "Client request error: connect ECONNREFUSED 10.88.0.221:9500",
  "statusCode": 502,
  "error": "Bad Gateway"
}

I've taken a look to the jvm.options file and it seems you are using an old version of it.
I see 2 major problems:

-XX:+DisableExplicitGC
-server

Can you please replace the content of your jvm.options file with the one which comes with Elasticsearch 7.3.0 (except setting the Xms / Xmx settings)?

It should be something like that:

-Xms8g
-Xmx8g

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10

-XX:+AlwaysPreTouch

-Xss1m

-Djava.awt.headless=true

-Dfile.encoding=UTF-8

-Djna.nosys=true

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

We originally had it at 8 and it was crashing but we moved it to 12 and it has been stable since last night

Even if you want to keep 12g, I strongly suggest to align the jvm.options file to the one which comes with Elasticsearch 7.3.

The heap sizing recommendations are available here.