Checksum failed (hardware problem?)

bigjohns97 · August 17, 2023, 9:13pm

I am using a simple one node Elasticsearch instance on a Debian VM running on top of a Hyper-V Windows 11 system with an AMD 3950 CPU DDR4 RAM and SSD disk. Using elasticsearch as a index for a graylog instance and sending logs from my network equipment to this via syslog I am setting retention via Graylog to cycle every 7 days.

Everytime this cycles I notice this error about checksum failed during the merge and creation of the new index.

I have to manually go in and delete the old index for the red alert to clear in graylog and when tracking down the error through elasticsearch I come up with this checksum failed error.

Here is the full output

 curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "pfsense_filterlog_28",
  "shard": 1,
  "primary": true
}
'
{
  "index" : "pfsense_filterlog_28",
  "shard" : 1,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-08-13T01:52:26.419Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [ktrLh1vHTAuk1iKoGbEuTQ]: shard failure, reason [merge failed], failure MergeException[org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; ",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "ktrLh1vHTAuk1iKoGbEuTQ",
      "node_name" : "debian",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "B-NovUobT_68k3UeF2gcQQ",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))"
            }
          }
        }
      }
    }

I know people usually point to hardware errors here but I don't have any hardware errors, I have plenty of other indices running on here that never have this issue and no other issues outside of Elasticsearch.

I have also checked the filesystem on boot and pre-mount and even set the OS to do this every time the system is booted and no issues with the FS is found.

Any idea what is happening here?

DavidTurner · August 18, 2023, 6:07am

There's some information about this in the manual:

TLDR the exception means that the data Elasticsearch read from storage is not the data that it originally wrote there. ES has no practical way of determining why it might be so.

Note the following paragraph from the docs I linked:

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

bigjohns97 · August 19, 2023, 3:03pm

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

How come none of my other rotating indices have issues?

How come I can't find any evidence of data corruption anywhere else on the system?

DavidTurner · August 19, 2023, 6:40pm

These are good questions, but the simplest way to answer them is covered in the docs I linked:

To narrow down the source of the corruptions, systematically change components in your cluster’s environment until the corruptions stop.

Data corruption bugs often need a very particular access pattern to trigger. For instance the one we blogged about a few years back was only seen after several days of indexing at full speed.

bigjohns97 · August 19, 2023, 8:16pm

I guess I will start with a new index and go from there. Will definitely post back whenever I find out what it is

bigjohns97 · September 2, 2023, 12:28pm

This ended up being bad RAM, discovered via bootable memtest86.

DavidTurner · September 2, 2023, 2:07pm

Thanks for following up here @bigjohns97, much appreciated! Bad RAM would definitely explain what you were seeing.

system · September 30, 2023, 2:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Checksum failed (hardware problem?) Elasticsearch	3	726	February 8, 2023
Lucene commit failed: checksum failed, while indexing Elasticsearch	4	1924	January 17, 2021
Frequent shard failures Elasticsearch	7	738	July 20, 2023
["org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) Elasticsearch docker , language-clients	1	224	March 18, 2024
Corrupt index, checksum failed Elasticsearch	1	1057	July 6, 2017

Checksum failed (hardware problem?)

Related topics