Data node fails while creating replicas

Hi,

Cluster info
5 master nodes - 16 GB RAM, 4 CPU
4 data nodes - 30.5 GB RAM, 4 CPU
1 Coordinator node - 16 GB RAM, 8 CPU
50% of RAM allocated to heap on all nodes

Use case
Restored a snapshot from S3 repo. Index size 120GB with ~66 million docs.
Index restored with 3 shards and 0 replicas. All goes well though heap size on data nodes is around 60%.

Once I update the settings to add 1 replica, around 40% into the replica creation a node crashes.
Logs:

[2019-06-24T08:50:18,280][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][15992] overhead, spent [6.2s] collecting in the last [6.8s]
[2019-06-24T08:50:18,270][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.e
c2.internal][management][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.fst.FST.<init>(FST.java:342) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.util.fst.FST.<init>(FST.java:274) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.search.suggest.document.NRTSuggester.load(NRTSuggester.java:306) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571
245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionsTermsReader.suggester(CompletionsTermsReader.java:66) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionTerms.suggester(CompletionTerms.java:71) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690
d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:219) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1027) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:121) ~[elasticsearch-6.6.1.jar:6
.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53) ~[elasticsearch-6.6.1.jar:6.
6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.ja
va:250) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterc
eptor.java:308) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1288) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Also in different times I observe this:

[2019-06-24T04:55:51,694][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][1957] overhead, spent [824ms] collecting in the last [1.4s]
[2019-06-24T04:55:52,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:12,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:32,825][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:52,878][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
  • This happens every time I try the above
  • ES version 6.6.1
  • This is the only index in the cluster
  • I am using Cerebro and Xpack
  • Elasticsearch is running on spot instances (though no spots were replaced during the issue)

Help?

Thanks,
Barak

Also, after starting ES again on the failed node, I get this after 2-3 minutes:

[2019-06-24T09:12:29,602][INFO ][o.e.n.Node               ] [ip-172-30-1-224.ec2.internal] started
[2019-06-24T09:14:08,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:28,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:48,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:08,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:28,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:48,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][219] overhead, spent [11.1s] collecting in the last [11.1s]
[2019-06-24T09:16:18,684][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][220] overhead, spent [5.2s] collecting in the last [5.2s]
[2019-06-24T09:16:21,829][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][221] overhead, spent [3.1s] collecting in the last [3.1s]
[2019-06-24T09:16:21,829][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
[2019-06-24T09:16:24,390][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][222] overhead, spent [2.5s] collecting in the last [2.5s]
[2019-06-24T09:16:24,397][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#8]], exiting
java.lang.OutOfMemoryError: Java heap space

What does your data look like? What data modelling features are you using?

What type of hardware is this cluster deployed on?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.