Hi,
Cluster info
5 master nodes - 16 GB RAM, 4 CPU
4 data nodes - 30.5 GB RAM, 4 CPU
1 Coordinator node - 16 GB RAM, 8 CPU
50% of RAM allocated to heap on all nodes
Use case
Restored a snapshot from S3 repo. Index size 120GB with ~66 million docs.
Index restored with 3 shards and 0 replicas. All goes well though heap size on data nodes is around 60%.
Once I update the settings to add 1 replica, around 40% into the replica creation a node crashes.
Logs:
[2019-06-24T08:50:18,280][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][15992] overhead, spent [6.2s] collecting in the last [6.8s]
[2019-06-24T08:50:18,270][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-30-1-224.ec2.internal] fatal error in thread [elasticsearch[ip-172-30-1-224.e
c2.internal][management][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.fst.FST.<init>(FST.java:342) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
at org.apache.lucene.util.fst.FST.<init>(FST.java:274) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:2
0]
at org.apache.lucene.search.suggest.document.NRTSuggester.load(NRTSuggester.java:306) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571
245f - nknize - 2018-12-07 14:44:57]
at org.apache.lucene.search.suggest.document.CompletionsTermsReader.suggester(CompletionsTermsReader.java:66) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:57]
at org.apache.lucene.search.suggest.document.CompletionTerms.suggester(CompletionTerms.java:71) ~[lucene-suggest-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690
d262946571245f - nknize - 2018-12-07 14:44:57]
at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:219) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1027) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:121) ~[elasticsearch-6.6.1.jar:6
.6.1]
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53) ~[elasticsearch-6.6.1.jar:6.
6.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:138) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:259) ~[elasticsearch-6.6.1.ja
r:6.6.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:255) ~[elasticsearch-6.6.1.ja
r:6.6.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.ja
va:250) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterc
eptor.java:308) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1288) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Also in different times I observe this:
[2019-06-24T04:55:51,694][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-30-1-224.ec2.internal] [gc][1957] overhead, spent [824ms] collecting in the last [1.4s]
[2019-06-24T04:55:52,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:12,600][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:32,825][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
[2019-06-24T04:56:52,878][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-1-224.ec2.internal] collector [node_stats] timed out when collecting data
- This happens every time I try the above
- ES version 6.6.1
- This is the only index in the cluster
- I am using Cerebro and Xpack
- Elasticsearch is running on spot instances (though no spots were replaced during the issue)
Help?
Thanks,
Barak