Hello,
I hope you and your loved ones are safe and healthy.
I have a 3 node cluster with 2 data nodes and 1 voting only node.
I accidentally deleted the separate disk that holds data for one of the data node. I separate out data folder for elasticsearch and OS/Installation folder on different disks.
Prior to the accident cluster was in green
state.
I have reattached a new disk to the VM but the synchronization fails at 96% and the elasticsearch service on the secondary node that has all of the data
crashes.
This cluseter holds two years worth of research data as part of my reading for MSc degree and I am hoping not to lose the data.
I am not panickig since one of the nodes has all of the data.
However the cluser is not synchornising and hence inoperational to collect additional data.
Largest shard is of 100 GB and most of the research data is in shards of ~70GB. There are 10 such shards
I have:
- Stopped all ingestion to the cluster.
- Stopped all other services (kibana, heartbeat, metricbeat) on the nodes
- All logs are ingested via a seperate node running logstash which is shutdown.
- Rebooted both of the data nodes.
- Shutdown the voting node, which lead to this error when trying to query primary node using postman:
"type": "master_not_discovered_exception",
- Restarted the voting only node.
- Increased RAM to the VMs at 48 GB per node and 24 GB to Elasticsearch through jvm.opions
- Set
sysctl -w vm.max_map_count=262144
Last log entry on the secondary node (the one that is holding all of the data) is:
[2021-05-29T18:47:07,958][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [secondarynode] Active license is now [TRIAL]; Security is enabled
[2021-05-29T18:47:07,964][INFO ][o.e.h.AbstractHttpServerTransport] [secondarynode] publish_address {192.168.0.236:9200}, bound_addresses {192.168.0.236:9200}
[2021-05-29T18:47:07,964][INFO ][o.e.n.Node ] [secondarynode] started
[2021-05-29T18:47:51,696][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [secondarynode] fatal error in thread [elasticsearch[secondarynode][generic][T#12]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler$1.tryAction(RemoteRecoveryTargetHandler.java:235) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:215) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:66) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.executeRetryableAction(RemoteRecoveryTargetHandler.java:245) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.writeFileChunk(RemoteRecoveryTargetHandler.java:205) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.RecoverySourceHandler$2.executeChunkRequest(RecoverySourceHandler.java:950) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.RecoverySourceHandler$2.executeChunkRequest(RecoverySourceHandler.java:903) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:112) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.access$000(MultiChunkTransfer.java:48) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:67) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:76) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:78) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:113) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:387) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:387) ~[elasticsearch-7.13.0.jar:7.13.0]
at
However, the service on the node that holds the data keeps crashing during sync. Here is the log as I see it on the primary node (the one where I deleted the data disk and is receiving the data).
Here are the last few lines on the primarynode:
[2021-05-29T19:19:56,469][INFO ][o.e.i.s.IndexShard ] [primarynode] [filebeat-7.12.1-2021.05.18][0] primary-replica resync completed with 0 operations
[2021-05-29T19:19:56,470][INFO ][o.e.i.s.IndexShard ] [primarynode] [.ml-config][0] primary-replica resync completed with 0 operations
[2021-05-29T19:19:56,485][INFO ][o.e.c.r.DelayedAllocationService] [primarynode] scheduling reroute for delayed shards in [58.7s] (629 delayed shards)
[2021-05-29T19:19:56,934][WARN ][o.e.a.b.TransportShardBulkAction] [primarynode] [[packetbeat-7.13.0-2021.05.28-000001][0]] failed to perform indices:data/write/bulk[s] on replica [packetbeat-7.13.0-2021.05.28-000001][0], node[DCEyrCsJSw6xVPw0xFpO5Q], [R], s[STARTED], a[id=Yup1YJgXQbKWWPl_fdxS9g]
org.elasticsearch.client.transport.NoNodeAvailableException: unknown node [DCEyrCsJSw6xVPw0xFpO5Q]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1070) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [secondarynode][192.168.0.236:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:178) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:780) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1077) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [secondarynode][192.168.0.236:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:178) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:780) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1077) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-05-29T19:19:56,935][WARN ][o.e.c.r.a.AllocationService] [primarynode] [packetbeat-7.13.0-2021.05.28-000001][0] marking unavailable shards as stale: [Yup1YJgXQbKWWPl_fdxS9g]
[2021-05-29T19:19:59,647][WARN ][o.e.c.r.a.AllocationService] [primarynode] [winlogbeat-7.13.0-2021.05.28-000001][0] marking unavailable shards as stale: [DlbwBVX2RgO5Ny1XOOpYvQ]
What should I do to make the cluster operational?
- Should I delete the data on the primarynode and attempt a resync?
- There are few indices which hold the research data (all with name Cowrie- while others I can recreate, how do I save my research data?)
Please view my comments to see the diagnostics I've attempted.