Hi,
We are running a 3 node Elasticsearch Cluster on Elastic stack version 8.8.1. The nodes are running on 3 identical computers with SSD storage and 16GB RAM. The setup is being used to index Firewall logs.
Node 1: es-node120
IP : 192.168.1.120
Node 2:es-node121
IP : 192.168.1.121
Node 3:es-node122
IP : 192.168.1.122
On certain days, the index gets corrupted and fails. This happened today, the extract of logs from the master node (es-node121) during the period:
[2023-06-19T10:01:00,793][WARN ][o.e.c.r.a.AllocationService] [es-node121] failing shard [FailedShard[routingEntry=[firewall-2023.06.19][0], node[3-jgQcnUSQueHzNULoWB5g], [R], s[STARTED], a[id=o70C-OxiTSSQTd2Y
ASa01Q], failed_attempts[0], message=shard failure, reason [already closed by tragic event on the index writer], failure=org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : exp
ected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt"))), markAsStale=true]]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/
ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1623) ~[?:?]
The logs on the data node - es-node122 during the period
[2023-06-19T10:00:58,070][WARN ][o.e.t.ThreadPool ] [es-node122] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@d4b9692] on thread pool [sam
e]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:908) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:921) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.getFlushingBytes(IndexWriter.java:795) ~[lucene-core-9.6.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.getWritingBytes(InternalEngine.java:667) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.getWritingBytes(IndexShard.java:1243) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:183) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:311) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:291) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:442) ~[elasticsearch-8.8.0.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1623) ~[?:?]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsear
ch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
... 3 more
[2023-06-19T10:00:58,107][WARN ][o.e.i.s.IndexShard ] [es-node122] [firewall-2023.06.19][0] failed to flush index
org.elasticsearch.index.engine.FlushFailedEngineException: Flush failed
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2064) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1623) ~[?:?]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsear
ch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
... 7 more
[2023-06-19T10:00:58,107][WARN ][o.e.i.e.Engine ] [es-node122] [firewall-2023.06.19][0] failed engine [already closed by tragic event on the index writer]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/
ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1623) ~[?:?]
There are no logs in the second data node (es-node120) for the period under observation.
The output of
GET firewall-2023.06.19/_search
is
{
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": null
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "firewall-2023.06.19",
"node": null,
"reason": {
"type": "no_shard_available_action_exception",
"reason": null
}
}
]
},
"status": 503
}
Is there any solution to this that doesn't involve loss of data?