Elasticsearch Keeps Crashing shards and data to big

When I restart elasticsearch the cluster will go into green for a little while then it will go back to read and kibana will no longer come up. Instead I see this error " {"message":"all shards failed: [search_phase_execution_exception] all shards failed","statusCode":503,"error":"Service Unavailable"} " Im not sure why this is happening ... it keeps crashing the over and over restarts dont last long.

I see data to large and shards failing but How do I fix this issue

[2020-04-22T15:53:54,901][DEBUG][o.e.a.s.TransportSearchAction] [atl-cla-deves01] All shards failed for phase: [query]
[2020-04-22T15:53:54,902][WARN ][r.suppressed             ] [atl-cla-deves01] path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:305) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:264) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$1(InitialSearchPhase.java:251) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:172) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-04-22T15:53:56,175][WARN ][r.suppressed             ] [atl-cla-deves01] path: /.kibana/_doc/space%3Adefault, params: {index=.kibana, id=space:default}
org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana][_doc][space:default]: routing [null]]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:228) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:205) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:103) [elasticsearch-7.3.0.jar:7.3.0]
[2020-04-22T15:54:17,574][WARN ][o.e.c.r.a.AllocationService] [atl-cla-deves01] [.monitoring-kibana-7-2020.04.22][0] marking unavailable shards as stale: [VOA34f_1Tr6A4evPYstGFQ]
[2020-04-22T15:54:17,574][WARN ][o.e.c.r.a.AllocationService] [atl-cla-deves01] [.monitoring-logstash-7-2020.04.22][0] marking unavailable shards as stale: [vmkB0OVWTtCM29_5fNeCrg]
[2020-04-22T15:54:22,056][INFO ][o.e.c.r.a.AllocationService] [atl-cla-deves01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-7-2020.04.22][0]] ...]).
[2020-04-22T15:54:22,539][WARN ][o.e.x.m.e.l.LocalExporter] [atl-cla-deves01] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: RemoteTransportException[[met-cla-deves03][10.188.0.223:9300][indices:data/write/bulk[s]]]; nested: CircuitBreakingException[[parent] Data too large, 
data for [<transport_request>] would be [8376777120/7.8gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8376769800/7.8gb], new bytes reserved: [7320/7.1kb], usages [request=0/0b, fielddata
=7020/6.8kb, in_flight_requests=7320/7.1kb, accounting=46573684/44.4mb]];
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.3.0.jar:7.3.0]
     org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:473) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:468) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:822) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:780) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.0.jar:7.3.0]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]

Caused by: org.elasticsearch.transport.RemoteTransportException: [met-cla-deves03][10.188.0.223:9300][indices:data/write/bulk[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [8376777120/7.8gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8376769800/7.8gb], new bytes reserved: [7320/7.1kb], usages [request=0/0b, fielddata=7020/6.8kb, in_flight_requests=7320/7.1kb, accounting=46573684/44.4mb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
      io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-04-22T17:23:40,549][INFO ][o.e.c.m.MetaDataCreateIndexService] [atl-cla-deves01] [filebeat-7.3.2-2020.04.22-000008] creating index, cause [rollover_index], templates [filebeat-7.3.2], shards [1]/[1], mappings [_doc]
[2020-04-22T17:23:40,928][INFO ][o.e.c.r.a.AllocationService] [atl-cla-deves01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[filebeat-7.3.2-2020.04.22-000008][0]] ...]).
[2020-04-22T17:50:34,754][WARN ][o.e.m.j.JvmGcMonitorService] [atl-cla-deves01] [gc][7012] overhead, spent [682ms] collecting in the last [1s]
[2020-04-22T17:51:44,119][INFO ][o.e.x.m.p.NativeController] [atl-cla-deves01] Native controller process has stopped - no new native processes can be started

How many nodes, shards, indices do you have?
What size heap on your nodes?

We have just recently added 3 more nodes to make it six, hoping that would help.

The circuit breaker is triggered and blocks few operations in the logs you've shared.

CircuitBreakingException[[parent] Data too large, 
data for [<transport_request>] would be [8376777120/7.8gb],
which is larger than the limit of [8127315968/7.5gb],
real usage: [8376769800/7.8gb],
new bytes reserved: [7320/7.1kb],
usages [request=0/0b,
fielddata
=7020/6.8kb,
in_flight_requests=7320/7.1kb,
accounting=46573684/44.4mb]];

The following error highlight there's JVM Heap usage and Garbage collection is being triggered:

[2020-04-22T17:50:34,754][WARN ][o.e.m.j.JvmGcMonitorService] [atl-cla-deves01] [gc][7012] overhead, spent [682ms] collecting in the last [1s]

Would you please share:

  • The Garbage Collection you're using?
  • The content of your jvm.options file
  • The output of GET _cat/nodes?v&h=version,jdk,heap.*,ram.*,name,fielddata.*,segments.*
  • The output of GET _cat/health?v

If you're using G1GC, it might be a problem which has been identified and fixed in 7.4.1+ (see PR).
The solution, if you're using G1GC, is applicable also on 7.4.0.
You have to edit the jvm.options file to match those settings:

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms12g
-Xmx12g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## optimizations

# disable calls to System#gc
-XX:+DisableExplicitGC

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# force the server VM (remove on 32-bit client JVMs)
-server

# explicitly set the stack size (reduce to 320k on 32-bit client JVMs)
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true

-Dindex.shard.check_on_startup=true

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${heap.dump.path}

## GC logging

#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime

# log GC status to a file with time stamps
# ensure the directory exists
#-Xloggc:${loggc}

# Elasticsearch 5.0.0 will throw an exception on unquoted field names in JSON.
# If documents were already indexed with unquoted fields in a previous version
# of Elasticsearch, some operations may throw errors.
#
# WARNING: This option will be removed in Elasticsearch 6.0.0 and is provided
# only for migration purposes.
#-Delasticsearch.json.allow_unquoted_field_names=true```

image

Please format the file using Markdown (use ``` to wrap the text).
It seems you're not using G1GC.

The commands I've shared should be executed on Kibana Dev Tools or via curl.

7.3.0   12.0.1      552.7mb            4   11.9gb      15.3gb          99  15.4gb atl-cla-deves02                13.3kb                   0            739          39.1mb                        3.9mb                          0b                      659.4kb
7.3.0   12.0.1      400.7mb            3   11.9gb      15.3gb          99  15.4gb met-cla-deves01                30.5kb                   0            775            34mb                        1.9mb                          0b                        1.3mb
7.3.0   12.0.1      537.5mb            4   11.9gb      15.2gb          98  15.4gb met-cla-deves02                20.5kb                   0            850          21.2mb                           0b                          0b                      956.5kb
7.3.0   12.0.1        643mb            5   11.9gb      15.3gb          99  15.4gb met-cla-deves03                37.5kb                   0            825          47.6mb                          7mb                          0b                        1.9mb
7.3.0   12.0.1      570.7mb            4   11.9gb      15.2gb          99  15.4gb atl-cla-deves03                18.7kb                   0            636          38.5mb                           0b                          0b                          1mb
7.3.0   12.0.1      978.3mb            7   11.9gb      15.3gb          99  15.4gb atl-cla-deves01                28.6kb                   0            738          31.3mb                       23.4mb                       123kb                        1.6mb
1587734763 13:26:03  cla-dev-elastic7 green           6         6    490 245    0    0        0             0                  -                100.0%

We are not on 7.4 we are on 7.3

My first comment would be that, if I am not wrong, your hosts have 16gb of RAM and you've configured the JVM Heap to 12gb.

We usually recommend to set the JVM Heap at maximum to 50% of the system memory.

Elasticsearch doesn't use only the JVM Heap but it makes use of off-heap memory & file system cache.

Please replace the JVM Heap size in jvm.options to:

-Xms8g
-Xmx8g

Could you also grab:

  • GET _cluster/settings?include_defaults&pretty - I would like to check the circuit breaker settings.
  • GET _cat/shards?v&bytes=b&s=index - I would like to check the size of the shards.
  "message": "Client request error: connect ECONNREFUSED 10.88.0.221:9500",
  "statusCode": 502,
  "error": "Bad Gateway"
}
{
  "message": "Client request error: connect ECONNREFUSED 10.88.0.221:9500",
  "statusCode": 502,
  "error": "Bad Gateway"
}

I've taken a look to the jvm.options file and it seems you are using an old version of it.
I see 2 major problems:

-XX:+DisableExplicitGC
-server

Can you please replace the content of your jvm.options file with the one which comes with Elasticsearch 7.3.0 (except setting the Xms / Xmx settings)?

It should be something like that:

-Xms8g
-Xmx8g

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10

-XX:+AlwaysPreTouch

-Xss1m

-Djava.awt.headless=true

-Dfile.encoding=UTF-8

-Djna.nosys=true

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

We originally had it at 8 and it was crashing but we moved it to 12 and it has been stable since last night

Even if you want to keep 12g, I strongly suggest to align the jvm.options file to the one which comes with Elasticsearch 7.3.

The heap sizing recommendations are available here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.