Checksum failed (hardware problem?)

I am using a simple one node Elasticsearch instance on a Debian VM running on top of a Hyper-V Windows 11 system with an AMD 3950 CPU DDR4 RAM and SSD disk. Using elasticsearch as a index for a graylog instance and sending logs from my network equipment to this via syslog I am setting retention via Graylog to cycle every 7 days.

Everytime this cycles I notice this error about checksum failed during the merge and creation of the new index.

I have to manually go in and delete the old index for the red alert to clear in graylog and when tracking down the error through elasticsearch I come up with this checksum failed error.

Here is the full output

 curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "pfsense_filterlog_28",
  "shard": 1,
  "primary": true
}
'
{
  "index" : "pfsense_filterlog_28",
  "shard" : 1,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-08-13T01:52:26.419Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [ktrLh1vHTAuk1iKoGbEuTQ]: shard failure, reason [merge failed], failure MergeException[org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; ",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "ktrLh1vHTAuk1iKoGbEuTQ",
      "node_name" : "debian",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "B-NovUobT_68k3UeF2gcQQ",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))"
            }
          }
        }
      }
    }

I know people usually point to hardware errors here but I don't have any hardware errors, I have plenty of other indices running on here that never have this issue and no other issues outside of Elasticsearch.

I have also checked the filesystem on boot and pre-mount and even set the OS to do this every time the system is booted and no issues with the FS is found.

Any idea what is happening here?

There's some information about this in the manual:

TLDR the exception means that the data Elasticsearch read from storage is not the data that it originally wrote there. ES has no practical way of determining why it might be so.

Note the following paragraph from the docs I linked:

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

1 Like

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

How come none of my other rotating indices have issues?

How come I can't find any evidence of data corruption anywhere else on the system?

These are good questions, but the simplest way to answer them is covered in the docs I linked:

To narrow down the source of the corruptions, systematically change components in your cluster’s environment until the corruptions stop.

Data corruption bugs often need a very particular access pattern to trigger. For instance the one we blogged about a few years back was only seen after several days of indexing at full speed.

1 Like

I guess I will start with a new index and go from there. Will definitely post back whenever I find out what it is

This ended up being bad RAM, discovered via bootable memtest86.

Thanks for following up here @bigjohns97, much appreciated! Bad RAM would definitely explain what you were seeing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.