I am using a simple one node Elasticsearch instance on a Debian VM running on top of a Hyper-V Windows 11 system with an AMD 3950 CPU DDR4 RAM and SSD disk. Using elasticsearch as a index for a graylog instance and sending logs from my network equipment to this via syslog I am setting retention via Graylog to cycle every 7 days.
Everytime this cycles I notice this error about checksum failed during the merge and creation of the new index.
I have to manually go in and delete the old index for the red alert to clear in graylog and when tracking down the error through elasticsearch I come up with this checksum failed error.
Here is the full output
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
"index": "pfsense_filterlog_28",
"shard": 1,
"primary": true
}
'
{
"index" : "pfsense_filterlog_28",
"shard" : 1,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2023-08-13T01:52:26.419Z",
"failed_allocation_attempts" : 1,
"details" : "failed shard on node [ktrLh1vHTAuk1iKoGbEuTQ]: shard failure, reason [merge failed], failure MergeException[org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))]; ",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions" : [
{
"node_id" : "ktrLh1vHTAuk1iKoGbEuTQ",
"node_name" : "debian",
"transport_address" : "127.0.0.1:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "B-NovUobT_68k3UeF2gcQQ",
"store_exception" : {
"type" : "corrupt_index_exception",
"reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
"caused_by" : {
"type" : "i_o_exception",
"reason" : "failed engine (reason: [merge failed])",
"caused_by" : {
"type" : "corrupt_index_exception",
"reason" : "checksum failed (hardware problem?) : expected=3a7148ca actual=b0e6e03 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/fS9fQIj0RjioMdvD0t4BuQ/1/index/_48yy.cfs\") [slice=_48yy_Lucene84_0.tim]))"
}
}
}
}
}
I know people usually point to hardware errors here but I don't have any hardware errors, I have plenty of other indices running on here that never have this issue and no other issues outside of Elasticsearch.
I have also checked the filesystem on boot and pre-mount and even set the OS to do this every time the system is booted and no issues with the FS is found.
Any idea what is happening here?