Node sync fails and cluster goes to "red"

Hello David,

Thank you for replying.

I do not have snapshots for the cluster. It unfortunately started with the intention to take snapshots which failed : https://discuss.elastic.co/t/backup-snapshot-strategy-for-a-two-data-node-cluster/

Here is the dmseg log from the node where the data is available (secondary node) . It in fact has new a SSD I got 2 weeks back :worried:

[ 9002.852451] sd 32:0:1:0: [sdb] tag#36 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9002.852462] sd 32:0:1:0: [sdb] tag#36 Sense Key : Hardware Error [current]
[ 9002.852475] sd 32:0:1:0: [sdb] tag#36 Add. Sense: Internal target failure
[ 9002.852478] sd 32:0:1:0: [sdb] tag#36 CDB: Read(10) 28 00 21 fa 09 00 00 01 00 00
[ 9002.852480] blk_update_request: critical target error, dev sdb, sector 570034432 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0
[ 9003.028072] sd 32:0:1:0: [sdb] tag#37 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.028086] sd 32:0:1:0: [sdb] tag#37 Sense Key : Hardware Error [current]
[ 9003.028089] sd 32:0:1:0: [sdb] tag#37 Add. Sense: Internal target failure
[ 9003.028091] sd 32:0:1:0: [sdb] tag#37 CDB: Read(10) 28 00 21 fa 0a 00 00 01 00 00
[ 9003.028094] blk_update_request: critical target error, dev sdb, sector 570034688 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0
[ 9003.294889] sd 32:0:1:0: [sdb] tag#42 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.294898] sd 32:0:1:0: [sdb] tag#42 Sense Key : Hardware Error [current]
[ 9003.294901] sd 32:0:1:0: [sdb] tag#42 Add. Sense: Internal target failure
[ 9003.294903] sd 32:0:1:0: [sdb] tag#42 CDB: Read(10) 28 00 21 fa 09 98 00 00 08 00
[ 9003.294906] blk_update_request: critical target error, dev sdb, sector 570034584 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 9003.467311] sd 32:0:1:0: [sdb] tag#43 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 9003.467324] sd 32:0:1:0: [sdb] tag#43 Sense Key : Hardware Error [current]
[ 9003.467326] sd 32:0:1:0: [sdb] tag#43 Add. Sense: Internal target failure
[ 9003.467329] sd 32:0:1:0: [sdb] tag#43 CDB: Read(10) 28 00 21 fa 09 98 00 00 08 00
[ 9003.467332] blk_update_request: critical target error, dev sdb, sector 570034584 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Will I be at least able to recovery the cluster with my research data?

Just to add information here, I store OS&Elasticsearch on a single disk and all of the data on a separate physical disk

What about using bin/elasticsearch-node repurpose or bin/elasticsearch-node detach-cluster and then reattach it ... it's a long shot but maybe it can help

Ok sdb is definitely broken. If you currently have no snapshots then that might mean there are no good copies of at least one shard.

There is no way to know whether this disk will fail further, so I recommend you take a snapshot of as much data as you can right now. You might need to go index-by-index since snapshotting the broken index will likely fail. Send the snapshots to an independent system, something like S3.

1 Like

This advice makes no sense, the node does not need repurposing and the cluster is forming fine so detaching is unnecessary. It's a straightforward read error.

Are both data nodes not also master nodes? Are you really running with a dedicated data node in a three node cluster?

1 Like

Thank you for that David.

I had attached an iSCSI volume to check if I can take the snapshot. I feel this error would be of that time. It was attached to sdb.

Secondarynode which has the data runs fine without the primary node. It shows all indices in yellow except two in red which were created yesterday. I am OK to lose this indices.

Can I do integrity validation on the indices to see which one is having the corrupt sectors?

Also, can I backup the cluster settings and just the indices I want (Cowrie-*) to S3 as opposed to all the indices and restore on the primary node?

I strongly recommend taking snapshots to an independent system like S3. Taking a snapshot to another potentially broken disk won't help.

Taking a snapshot includes integrity validation. Work index-by-index until you find the one that fails.

Yes, you can (and will need to) snapshot a single index at a time.

1 Like

Hello Christian:

Here is the setting from elasticsearch.yml for the secondary node (192.168.0.236) that is having the data intact:

# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
#discovery.seed_hosts: ["primarynode", "secondarynode"]
#discovery.seed_hosts: ["primarynode", "secondarynode", "votingonlynode"]
discovery.seed_hosts: ["192.168.0.235", "192.168.0.236", "192.168.0.237"]
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#cluster.initial_master_nodes: ["primarynode", "secondarynode"]

#30-05-2021: Commented out on the primarynode as part of diagnosing sync failure because I deleted data disk on the primarynode. Entry noted  here for consistency. 
cluster.initial_master_nodes: ["192.168.0.235", "192.168.0.236"]



# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
node.master: true
node.voting_only: false

I had a community member ask me to carry out the following to try and restore but it fails at writing 300 GB to the primarynode that I am trying to sync data to:

  1. Rename _state to _state_backup (anything but _state)
  2. Comment out: cluster.initial_master_nodes: ["192.168.0.235", "192.168.0.236"] and reboot the cluster.

Yes there are 2 data nodes and one voting node in the cluster.

This did not solve the problem

They gave you some very bad advice. You should never manipulate the contents of the data path yourself.

(The advice to remove cluster.initial_master_nodes is sound, you should do that anyway, but I don't think it's relevant here)

2 Likes

Thanks David. Thank you for the advise, in my case data.path and data.logs both were on separate disk that on a VM that I deleted by mistake so I reckon they were empty and waiting for be synced.

I suspect disk errors would have come when I attached a new disk which took up /dev/sdb.

I've run badblocks to see if I can get additional data.

should I remove cluster.initial_master_nodes from all the nodes now or wait for full recovery before thouching those settings?

Could I ask you to have a look at this: Recent unsafe memory access operation in compiled Java code - #7 by jprante. I feel it maybe the size of the indices that is causing the sync issue?

I knew it was a long shot ... thank for taking your time for correcting me.

2 Likes

Up to you, there's no rush, it's nothing to do with your current problems.

No, it's a faulty disk, it could affect an index of any size. I suppose a bad block is more likely to affect a larger index just because larger indices have more blocks.

The post you linked seems quite confused, we don't use mmap to write anything, we never require having the whole file in memory at once anyway, and even if we did it wouldn't result in errors like this when running out of space.

1 Like

Thank you very much for the extended support David. I deeply appreciate it.

While I did not doubt you, the fault SSD seems to be correct! Because I recently added this disk and attached it to the VM. At that time the data synced without any issues! I should have thought of it (just a testament that I need to take a break from studies and work).

I've run badlocks on the disk and I will pass it to fsck to help while the replacement disk comes in.

I'll try and delete larger indices which I don't need with the hope that one of them had the bad block(s). I'll try and sync the nodes after this. Hopefully it will work. My research work is almost 750 GB in size split into 15 indices. I will try and snapshot them to S3 but I am not sure given my home internet connection upload speed (capped at 5 Mbps).

Keeping my fingers crossed.

With all the resiliency planned (separating OS/Application and data into separate physical disks) one error in fatigue (of deleting the virtual disk on the primary node) has caused sincere pain to me. Exhaustion / burnout is real. This has been a learning.

Hello @DavidTurner , is it safe to delete following indices which are large in size and I reckon they do not have the raw data but analysed data which I can recreate?

green  open   .monitoring-es-7-2021.05.25                                    Z87RDUcyQZqiFOiL-huaIw   1   0   10089695      1249720      3.9gb          3.9gb
green  open   .monitoring-es-7-2021.05.24                                    8yjVzoSJQkSL1YYce_tdkA   1   0    9922505      1331520      3.8gb          3.8gb
green  open   .monitoring-es-7-mb-2021.05.28                                 WecYTFQoSgeqY3L1H1gWig   1   0    6966750            0      3.8gb          3.8gb
green  open   .monitoring-es-7-2021.05.27                                    SCWv7GjkRKi51ozluCLMpA   1   0   10503113       991852      3.7gb          3.7gb
green  open   .monitoring-es-7-2021.05.26                                    CW_pa3xLTieP8sTtQhFihA   1   0    9978777      1228898      3.6gb          3.6gb
green  open   .monitoring-es-7-mb-2021.05.29                                 -gzR63-YSTm5nAAYMnawGA   1   0    7348808            0      3.4gb          3.4gb
green  open   .monitoring-es-7-mb-2021.05.26                                 kMGnzT9hTjCuzQFKQBgwtQ   1   0    5195177            0      2.8gb          2.8gb

I've deleted most of the large indices except the that store the raw research data. Hopefully this will work out now :slight_smile:

If it does I will apply leave at for entire week :smiley: & just relax.

Their names suggest that these indices contain monitoring data about your cluster. I doubt you can re-create it but it's up to you whether you need to keep it or not.

However I strongly recommend you focus on making a copy of your important data elsewhere first. Disk failures can be progressive so the longer you delay the bigger the risk you will lose more data. It's possible that deleting data will cause other data to move around, potentially causing more of it to end up written to unreadable blocks. Moreover I can't think of a mechanism by which deleting some unimportant data will help preserve any of the important data. This path seems only to have downsides.

1 Like

Hello,

I'm very happy to report that the indices holding my research data are not on bad sectors.

I have successfully been able to sync data to primary node (and hence other SSD).

I am now going to take snapshot on AWS (or Linode since cost calculator for AWS is at 50 USD :worried: ). OR on my NAS using NFS mount (NAS has hardware layer redundancy)

Second: I can see cluster still being RED because single indice does not have primary or replica shards. I am sure it is lost, I am unable to delete it. The size is 0 if I query elasticsearch. I cannot see inside Kibana GUI. -- Is there a way to resolve this?

red    open   .ds-ilm-history-5-2021.05.13-000004                            o2ew_zK5RGSgP-binaUbyw   1   1                                                

I have rebuild my dashboards and other components, which I reckon is my only choice - backup and test your backups is a learning

Once the snapshots are done, I will move the storage of Elasticsearch data to another SSD.

Thank you to @DavidTurner @Christian_Dahlqvist and @csaltos for the extended help. Have a wonderful time ahead.

Hello,

I've mounted a NFS share & added it as a repository on the data nodes in the cluster and checked its connectivity:

image

Backups are however failing unless I keep following on Ignore unavailable indices and Allow partial indices turned on. Error:

{
  "type": "snapshot_exception",
  "reason": "[nas_backup:daily-full--dpmpzql9t2g0qla24hpxtq/eNAtkTMQSVmAeyaPcVIe1A] Indices don't have primary shards [.ds-ilm-history-5-2021.05.13-000004]",
  "stack_trace": "SnapshotException[[nas_backup:daily-full--dpmpzql9t2g0qla24hpxtq/eNAtkTMQSVmAeyaPcVIe1A] Indices don't have primary shards [.ds-ilm-history-5-2021.05.13-000004]]\n\tat org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:491)\n\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:393)\n\tat org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48)\n\tat org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691)\n\tat org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:313)\n\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:208)\n\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62)\n\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140)\n\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)\n\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:831)\n"
}

I've set individual policy just to backup the research data. It is ongoing. I will update on its completion status soon.

Hello @DavidTurner , I wanted to thank you once again, the cluster is green, backed up and on a new disk :slight_smile: all of my research data is intact and now backed up locally on a NAS (with disk level redundancy) and in the cloud :smiley: .

I wanted to ask if it is worthwhile for metricbeat or other component of Elastic Stack to keep an eye on disk errors? I know there isn't a specific module that looks at hardware health level but based on my experience it could be an excellent indicator to be monitored under stack monitoring?

If you feel this is a meaningful addition I can open a GitHub feature request. :slight_smile:

Thank you and have a wonderful week ahead :smiley:

IMO yes, see e.g. Collect SMART data with metricbeat 路 Issue #8614 路 elastic/beats 路 GitHub and report disk failures 路 Issue #20562 路 elastic/beats 路 GitHub but it's far from easy since SMART metrics are hard to read and not a very reliable indicator, and the log messages are even harder to identify.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.