Data node fails while creating replicas

Barak · June 24, 2019, 9:28am

Hi,

Cluster info
5 master nodes - 16 GB RAM, 4 CPU
4 data nodes - 30.5 GB RAM, 4 CPU
1 Coordinator node - 16 GB RAM, 8 CPU
50% of RAM allocated to heap on all nodes

Use case
Restored a snapshot from S3 repo. Index size 120GB with ~66 million docs.
Index restored with 3 shards and 0 replicas. All goes well though heap size on data nodes is around 60%.

Once I update the settings to add 1 replica, around 40% into the replica creation a node crashes.
Logs:

[2019-06-24T08:50:18,280][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][15992] overhead, spent [6.2s] collecting in the last [6.8s]
[2019-06-24T08:50:18,270][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.e
c2.internal][management][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.fst.FST.<init>(FST.java:342) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.util.fst.FST.<init>(FST.java:274) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
        at org.apache.lucene.search.suggest.document.NRTSuggester.load(NRTSuggester.java:306) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571
245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionsTermsReader.suggester(CompletionsTermsReader.java:66) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.apache.lucene.search.suggest.document.CompletionTerms.suggester(CompletionTerms.java:71) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690
d262946571245f - nknize - 2018-12-07 14:44:57]
        at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:219) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1027) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:121) ~[elasticsearch-6.6.1.jar:6
.6.1]
        at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53) ~[elasticsearch-6.6.1.jar:6.
6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[elasticsearch-6.6.1.ja
r:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.ja
va:250) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterc
eptor.java:308) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1288) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Also in different times I observe this:

[2019-06-24T04:55:51,694][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][1957] overhead, spent [824ms] collecting in the last [1.4s]
[2019-06-24T04:55:52,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:12,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:32,825][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:52,878][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data

This happens every time I try the above
ES version 6.6.1
This is the only index in the cluster
I am using Cerebro and Xpack
Elasticsearch is running on spot instances (though no spots were replaced during the issue)

Help?

Thanks,
Barak

Barak · June 24, 2019, 9:32am

Also, after starting ES again on the failed node, I get this after 2-3 minutes:

[2019-06-24T09:12:29,602][INFO ][o.e.n.Node               ] [ip-172-30-1-224.ec2.internal] started
[2019-06-24T09:14:08,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:28,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:14:48,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:08,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:28,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:15:48,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T09:16:13,440][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][219] overhead, spent [11.1s] collecting in the last [11.1s]
[2019-06-24T09:16:18,684][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][220] overhead, spent [5.2s] collecting in the last [5.2s]
[2019-06-24T09:16:21,829][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][221] overhead, spent [3.1s] collecting in the last [3.1s]
[2019-06-24T09:16:21,829][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
[2019-06-24T09:16:24,390][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][222] overhead, spent [2.5s] collecting in the last [2.5s]
[2019-06-24T09:16:24,397][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.ec2.internal][management][T#8]], exiting
java.lang.OutOfMemoryError: Java heap space

Christian_Dahlqvist · June 24, 2019, 5:47pm

What does your data look like? What data modelling features are you using?

What type of hardware is this cluster deployed on?

system · July 22, 2019, 5:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic node crashing due to java.lang.OutOfMemoryError: Java heap space Elasticsearch	1	277	July 26, 2023
OutOfMemoryError (Java heap space) during replication enabling on 90 indices Elasticsearch	6	1224	July 6, 2017
Best way to resolve this out of memory error in my ES Data Nodes Elasticsearch	3	880	March 6, 2021
Data-only node keeps crashing with oom error Elasticsearch	7	2095	July 5, 2017
Out of memory of data nodes Elasticsearch	5	1244	February 23, 2018

Data node fails while creating replicas

Related topics