Node sync fails and cluster goes to "red"

Hello,

I hope you and your loved ones are safe and healthy.

I have a 3 node cluster with 2 data nodes and 1 voting only node.

I accidentally deleted the separate disk that holds data for one of the data node. I separate out data folder for elasticsearch and OS/Installation folder on different disks.

Prior to the accident cluster was in green state.

I have reattached a new disk to the VM but the synchronization fails at 96% and the elasticsearch service on the secondary node that has all of the data crashes.

This cluseter holds two years worth of research data as part of my reading for MSc degree and I am hoping not to lose the data.

I am not panickig since one of the nodes has all of the data.

However the cluser is not synchornising and hence inoperational to collect additional data.

Largest shard is of 100 GB and most of the research data is in shards of ~70GB. There are 10 such shards

I have:

  1. Stopped all ingestion to the cluster.
  2. Stopped all other services (kibana, heartbeat, metricbeat) on the nodes
  3. All logs are ingested via a seperate node running logstash which is shutdown.
  4. Rebooted both of the data nodes.
  5. Shutdown the voting node, which lead to this error when trying to query primary node using postman:
"type": "master_not_discovered_exception",
  1. Restarted the voting only node.
  2. Increased RAM to the VMs at 48 GB per node and 24 GB to Elasticsearch through jvm.opions
  3. Set sysctl -w vm.max_map_count=262144

Last log entry on the secondary node (the one that is holding all of the data) is:

[2021-05-29T18:47:07,958][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [secondarynode] Active license is now [TRIAL]; Security is enabled
[2021-05-29T18:47:07,964][INFO ][o.e.h.AbstractHttpServerTransport] [secondarynode] publish_address {192.168.0.236:9200}, bound_addresses {192.168.0.236:9200}
[2021-05-29T18:47:07,964][INFO ][o.e.n.Node               ] [secondarynode] started
[2021-05-29T18:47:51,696][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [secondarynode] fatal error in thread [elasticsearch[secondarynode][generic][T#12]], exiting
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler$1.tryAction(RemoteRecoveryTargetHandler.java:235) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:215) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:66) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.executeRetryableAction(RemoteRecoveryTargetHandler.java:245) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.writeFileChunk(RemoteRecoveryTargetHandler.java:205) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$2.executeChunkRequest(RecoverySourceHandler.java:950) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$2.executeChunkRequest(RecoverySourceHandler.java:903) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:112) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.access$000(MultiChunkTransfer.java:48) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:67) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:76) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:78) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:113) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:387) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) ~[elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:387) ~[elasticsearch-7.13.0.jar:7.13.0]
        at 

However, the service on the node that holds the data keeps crashing during sync. Here is the log as I see it on the primary node (the one where I deleted the data disk and is receiving the data).

Here are the last few lines on the primarynode:

[2021-05-29T19:19:56,469][INFO ][o.e.i.s.IndexShard       ] [primarynode] [filebeat-7.12.1-2021.05.18][0] primary-replica resync completed with 0 operations
[2021-05-29T19:19:56,470][INFO ][o.e.i.s.IndexShard       ] [primarynode] [.ml-config][0] primary-replica resync completed with 0 operations
[2021-05-29T19:19:56,485][INFO ][o.e.c.r.DelayedAllocationService] [primarynode] scheduling reroute for delayed shards in [58.7s] (629 delayed shards)
[2021-05-29T19:19:56,934][WARN ][o.e.a.b.TransportShardBulkAction] [primarynode] [[packetbeat-7.13.0-2021.05.28-000001][0]] failed to perform indices:data/write/bulk[s] on replica [packetbeat-7.13.0-2021.05.28-000001][0], node[DCEyrCsJSw6xVPw0xFpO5Q], [R], s[STARTED], a[id=Yup1YJgXQbKWWPl_fdxS9g]
org.elasticsearch.client.transport.NoNodeAvailableException: unknown node [DCEyrCsJSw6xVPw0xFpO5Q]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1070) [elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
        Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [secondarynode][192.168.0.236:9300] Node not connected
                at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:178) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:780) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1077) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
                at java.lang.Thread.run(Thread.java:831) [?:?]
        Suppressed: org.elasticsearch.transport.NodeNotConnectedException: [secondarynode][192.168.0.236:9300] Node not connected
                at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:178) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:780) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:679) ~[elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1077) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:233) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:88) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) 
  
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
                at java.lang.Thread.run(Thread.java:831) [?:?]
[2021-05-29T19:19:56,935][WARN ][o.e.c.r.a.AllocationService] [primarynode] [packetbeat-7.13.0-2021.05.28-000001][0] marking unavailable shards as stale: [Yup1YJgXQbKWWPl_fdxS9g]
[2021-05-29T19:19:59,647][WARN ][o.e.c.r.a.AllocationService] [primarynode] [winlogbeat-7.13.0-2021.05.28-000001][0] marking unavailable shards as stale: [DlbwBVX2RgO5Ny1XOOpYvQ]

What should I do to make the cluster operational?

  1. Should I delete the data on the primarynode and attempt a resync?
  2. There are few indices which hold the research data (all with name Cowrie- while others I can recreate, how do I save my research data?)

Please view my comments to see the diagnostics I've attempted.

Due to character limitations I am continuing here:

When I run : _cat/indices/*?v&s=pri.store.size:desc on the secondary node, see that only the indices that I do not need are in red (with total of just 4)

And all of my research indices are in Yellow:

However this changes when I look at the same data from the primary node: My research indices are in red

There are total of 61 indices in red when I query the primarynode

My guess would be a faulty disk on this node. Does dmesg report any problems?

Do you have snapshots of this data too?

1 Like

Hello David,

Thank you for replying.

I do not have snapshots for the cluster. It unfortunately started with the intention to take snapshots which failed : https://discuss.elastic.co/t/backup-snapshot-strategy-for-a-two-data-node-cluster/

Here is the dmseg log from the node where the data is available (secondary node) . It in fact has new a SSD I got 2 weeks back :worried:

[ 9002.852451] sd 32:0:1:0: [sdb] tag#36 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9002.852462] sd 32:0:1:0: [sdb] tag#36 Sense Key : Hardware Error [current]
[ 9002.852475] sd 32:0:1:0: [sdb] tag#36 Add. Sense: Internal target failure
[ 9002.852478] sd 32:0:1:0: [sdb] tag#36 CDB: Read(10) 28 00 21 fa 09 00 00 01 00 00
[ 9002.852480] blk_update_request: critical target error, dev sdb, sector 570034432 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0
[ 9003.028072] sd 32:0:1:0: [sdb] tag#37 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.028086] sd 32:0:1:0: [sdb] tag#37 Sense Key : Hardware Error [current]
[ 9003.028089] sd 32:0:1:0: [sdb] tag#37 Add. Sense: Internal target failure
[ 9003.028091] sd 32:0:1:0: [sdb] tag#37 CDB: Read(10) 28 00 21 fa 0a 00 00 01 00 00
[ 9003.028094] blk_update_request: critical target error, dev sdb, sector 570034688 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0
[ 9003.294889] sd 32:0:1:0: [sdb] tag#42 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.294898] sd 32:0:1:0: [sdb] tag#42 Sense Key : Hardware Error [current]
[ 9003.294901] sd 32:0:1:0: [sdb] tag#42 Add. Sense: Internal target failure
[ 9003.294903] sd 32:0:1:0: [sdb] tag#42 CDB: Read(10) 28 00 21 fa 09 98 00 00 08 00
[ 9003.294906] blk_update_request: critical target error, dev sdb, sector 570034584 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 9003.467311] sd 32:0:1:0: [sdb] tag#43 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.467324] sd 32:0:1:0: [sdb] tag#43 Sense Key : Hardware Error [current]
[ 9003.467326] sd 32:0:1:0: [sdb] tag#43 Add. Sense: Internal target failure
[ 9003.467329] sd 32:0:1:0: [sdb] tag#43 CDB: Read(10) 28 00 21 fa 09 98 00 00 08 00
[ 9003.467332] blk_update_request: critical target error, dev sdb, sector 570034584 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Will I be at least able to recovery the cluster with my research data?

Just to add information here, I store OS&Elasticsearch on a single disk and all of the data on a separate physical disk

What about using bin/elasticsearch-node repurpose or bin/elasticsearch-node detach-cluster and then reattach it ... it's a long shot but maybe it can help

Ok sdb is definitely broken. If you currently have no snapshots then that might mean there are no good copies of at least one shard.

There is no way to know whether this disk will fail further, so I recommend you take a snapshot of as much data as you can right now. You might need to go index-by-index since snapshotting the broken index will likely fail. Send the snapshots to an independent system, something like S3.

1 Like

This advice makes no sense, the node does not need repurposing and the cluster is forming fine so detaching is unnecessary. It's a straightforward read error.

Are both data nodes not also master nodes? Are you really running with a dedicated data node in a three node cluster?

1 Like

Thank you for that David.

I had attached an iSCSI volume to check if I can take the snapshot. I feel this error would be of that time. It was attached to sdb.

Secondarynode which has the data runs fine without the primary node. It shows all indices in yellow except two in red which were created yesterday. I am OK to lose this indices.

Can I do integrity validation on the indices to see which one is having the corrupt sectors?

Also, can I backup the cluster settings and just the indices I want (Cowrie-*) to S3 as opposed to all the indices and restore on the primary node?

I strongly recommend taking snapshots to an independent system like S3. Taking a snapshot to another potentially broken disk won't help.

Taking a snapshot includes integrity validation. Work index-by-index until you find the one that fails.

Yes, you can (and will need to) snapshot a single index at a time.

1 Like

Hello Christian:

Here is the setting from elasticsearch.yml for the secondary node (192.168.0.236) that is having the data intact:

# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
#discovery.seed_hosts: ["primarynode", "secondarynode"]
#discovery.seed_hosts: ["primarynode", "secondarynode", "votingonlynode"]
discovery.seed_hosts: ["192.168.0.235", "192.168.0.236", "192.168.0.237"]
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#cluster.initial_master_nodes: ["primarynode", "secondarynode"]

#30-05-2021: Commented out on the primarynode as part of diagnosing sync failure because I deleted data disk on the primarynode. Entry noted  here for consistency. 
cluster.initial_master_nodes: ["192.168.0.235", "192.168.0.236"]



# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
node.master: true
node.voting_only: false

I had a community member ask me to carry out the following to try and restore but it fails at writing 300 GB to the primarynode that I am trying to sync data to:

  1. Rename _state to _state_backup (anything but _state)
  2. Comment out: cluster.initial_master_nodes: ["192.168.0.235", "192.168.0.236"] and reboot the cluster.

Yes there are 2 data nodes and one voting node in the cluster.

This did not solve the problem

They gave you some very bad advice. You should never manipulate the contents of the data path yourself.

(The advice to remove cluster.initial_master_nodes is sound, you should do that anyway, but I don't think it's relevant here)

2 Likes

Thanks David. Thank you for the advise, in my case data.path and data.logs both were on separate disk that on a VM that I deleted by mistake so I reckon they were empty and waiting for be synced.

I suspect disk errors would have come when I attached a new disk which took up /dev/sdb.

I've run badblocks to see if I can get additional data.

should I remove cluster.initial_master_nodes from all the nodes now or wait for full recovery before thouching those settings?

Could I ask you to have a look at this: Recent unsafe memory access operation in compiled Java code - #7 by jprante. I feel it maybe the size of the indices that is causing the sync issue?

I knew it was a long shot ... thank for taking your time for correcting me.

2 Likes

Up to you, there's no rush, it's nothing to do with your current problems.

No, it's a faulty disk, it could affect an index of any size. I suppose a bad block is more likely to affect a larger index just because larger indices have more blocks.

The post you linked seems quite confused, we don't use mmap to write anything, we never require having the whole file in memory at once anyway, and even if we did it wouldn't result in errors like this when running out of space.

1 Like

Thank you very much for the extended support David. I deeply appreciate it.

While I did not doubt you, the fault SSD seems to be correct! Because I recently added this disk and attached it to the VM. At that time the data synced without any issues! I should have thought of it (just a testament that I need to take a break from studies and work).

I've run badlocks on the disk and I will pass it to fsck to help while the replacement disk comes in.

I'll try and delete larger indices which I don't need with the hope that one of them had the bad block(s). I'll try and sync the nodes after this. Hopefully it will work. My research work is almost 750 GB in size split into 15 indices. I will try and snapshot them to S3 but I am not sure given my home internet connection upload speed (capped at 5 Mbps).

Keeping my fingers crossed.

With all the resiliency planned (separating OS/Application and data into separate physical disks) one error in fatigue (of deleting the virtual disk on the primary node) has caused sincere pain to me. Exhaustion / burnout is real. This has been a learning.

Hello @DavidTurner , is it safe to delete following indices which are large in size and I reckon they do not have the raw data but analysed data which I can recreate?

green  open   .monitoring-es-7-2021.05.25                                    Z87RDUcyQZqiFOiL-huaIw   1   0   10089695      1249720      3.9gb          3.9gb
green  open   .monitoring-es-7-2021.05.24                                    8yjVzoSJQkSL1YYce_tdkA   1   0    9922505      1331520      3.8gb          3.8gb
green  open   .monitoring-es-7-mb-2021.05.28                                 WecYTFQoSgeqY3L1H1gWig   1   0    6966750            0      3.8gb          3.8gb
green  open   .monitoring-es-7-2021.05.27                                    SCWv7GjkRKi51ozluCLMpA   1   0   10503113       991852      3.7gb          3.7gb
green  open   .monitoring-es-7-2021.05.26                                    CW_pa3xLTieP8sTtQhFihA   1   0    9978777      1228898      3.6gb          3.6gb
green  open   .monitoring-es-7-mb-2021.05.29                                 -gzR63-YSTm5nAAYMnawGA   1   0    7348808            0      3.4gb          3.4gb
green  open   .monitoring-es-7-mb-2021.05.26                                 kMGnzT9hTjCuzQFKQBgwtQ   1   0    5195177            0      2.8gb          2.8gb

I've deleted most of the large indices except the that store the raw research data. Hopefully this will work out now :slight_smile:

If it does I will apply leave at for entire week :smiley: & just relax.

Their names suggest that these indices contain monitoring data about your cluster. I doubt you can re-create it but it's up to you whether you need to keep it or not.

However I strongly recommend you focus on making a copy of your important data elsewhere first. Disk failures can be progressive so the longer you delay the bigger the risk you will lose more data. It's possible that deleting data will cause other data to move around, potentially causing more of it to end up written to unreadable blocks. Moreover I can't think of a mechanism by which deleting some unimportant data will help preserve any of the important data. This path seems only to have downsides.

1 Like

Hello,

I'm very happy to report that the indices holding my research data are not on bad sectors.

I have successfully been able to sync data to primary node (and hence other SSD).

I am now going to take snapshot on AWS (or Linode since cost calculator for AWS is at 50 USD :worried: ). OR on my NAS using NFS mount (NAS has hardware layer redundancy)

Second: I can see cluster still being RED because single indice does not have primary or replica shards. I am sure it is lost, I am unable to delete it. The size is 0 if I query elasticsearch. I cannot see inside Kibana GUI. -- Is there a way to resolve this?

red    open   .ds-ilm-history-5-2021.05.13-000004                            o2ew_zK5RGSgP-binaUbyw   1   1                                                

I have rebuild my dashboards and other components, which I reckon is my only choice - backup and test your backups is a learning

Once the snapshots are done, I will move the storage of Elasticsearch data to another SSD.

Thank you to @DavidTurner @Christian_Dahlqvist and @csaltos for the extended help. Have a wonderful time ahead.

Hello,

I've mounted a NFS share & added it as a repository on the data nodes in the cluster and checked its connectivity:

image

Backups are however failing unless I keep following on Ignore unavailable indices and Allow partial indices turned on. Error:

{
  "type": "snapshot_exception",
  "reason": "[nas_backup:daily-full--dpmpzql9t2g0qla24hpxtq/eNAtkTMQSVmAeyaPcVIe1A] Indices don't have primary shards [.ds-ilm-history-5-2021.05.13-000004]",
  "stack_trace": "SnapshotException[[nas_backup:daily-full--dpmpzql9t2g0qla24hpxtq/eNAtkTMQSVmAeyaPcVIe1A] Indices don't have primary shards [.ds-ilm-history-5-2021.05.13-000004]]\n\tat org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:491)\n\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:393)\n\tat org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48)\n\tat org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691)\n\tat org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:313)\n\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:208)\n\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62)\n\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140)\n\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)\n\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:831)\n"
}

I've set individual policy just to backup the research data. It is ongoing. I will update on its completion status soon.