Data node fails while creating replicas

Hi,

Cluster info
5 master nodes - 16 GB RAM, 4 CPU
4 data nodes - 30.5 GB RAM, 4 CPU
1 Coordinator node - 16 GB RAM, 8 CPU
50% of RAM allocated to heap on all nodes

Use case
Restored a snapshot from S3 repo. Index size 120GB with ~66 million docs.
Index restored with 3 shards and 0 replicas. All goes well though heap size on data nodes is around 60%.

Once I update the settings to add 1 replica, around 40% into the replica creation a node crashes.
Logs:

[2019-06-24T08:50:18,280][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][15992] overhead, spent [6.2s] collecting in the last [6.8s]
[2019-06-24T08:50:18,270][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.e
c2.internal][management][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.fst.FST.<init>(FST.java:342) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.util.fst.FST.<init>(FST.java:274) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.search.suggest.document.NRTSuggester.load(NRTSuggester.java:306) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571
245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionsTermsReader.suggester(CompletionsTermsReader.java:66) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionTerms.suggester(CompletionTerms.java:71) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690
d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:219) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1027) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:121) ~[elasticsearch-6.6.1.jar:6
.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53) ~[elasticsearch-6.6.1.jar:6.
6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.ja
va:250) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterc
eptor.java:308) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1288) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Also in different times I observe this:

[2019-06-24T04:55:51,694][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][1957] overhead, spent [824ms] collecting in the last [1.4s]
[2019-06-24T04:55:52,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:12,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:32,825][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:52,878][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
  • This happens every time I try the above
  • ES version 6.6.1
  • This is the only index in the cluster
  • I am using Cerebro and Xpack
  • Elasticsearch is running on spot instances (though no spots were replaced during the issue)

Help?

Thanks,
Barak

Also, after starting ES again on the failed node, I get this after 2-3 minutes:

[2019-06-24T09:12:29,602][INFO ][o.e.n.Node               ] [ip-172-30-1-224.ec2.internal] started
[2019-06-24T09:14:08,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:28,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:48,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:08,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:28,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:48,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][219] overhead, spent [11.1s] collecting in the last [11.1s]
[2019-06-24T09:16:18,684][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][220] overhead, spent [5.2s] collecting in the last [5.2s]
[2019-06-24T09:16:21,829][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][221] overhead, spent [3.1s] collecting in the last [3.1s]
[2019-06-24T09:16:21,829][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
[2019-06-24T09:16:24,390][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][222] overhead, spent [2.5s] collecting in the last [2.5s]
[2019-06-24T09:16:24,397][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#8]], exiting
java.lang.OutOfMemoryError: Java heap space

What does your data look like? What data modelling features are you using?

What type of hardware is this cluster deployed on?