Frequent shard failures

Hi,

We are running a 3 node Elasticsearch Cluster on Elastic stack version 8.8.1. The nodes are running on 3 identical computers with SSD storage and 16GB RAM. The setup is being used to index Firewall logs.

Node 1: es-node120
IP : 192.168.1.120

Node 2:es-node121
IP : 192.168.1.121

Node 3:es-node122
IP : 192.168.1.122

On certain days, the index gets corrupted and fails. This happened today, the extract of logs from the master node (es-node121) during the period:

[2023-06-19T10:01:00,793][WARN ][o.e.c.r.a.AllocationService] [es-node121] failing shard [FailedShard[routingEntry=[firewall-2023.06.19][0], node[3-jgQcnUSQueHzNULoWB5g], [R], s[STARTED], a[id=o70C-OxiTSSQTd2Y
ASa01Q], failed_attempts[0], message=shard failure, reason [already closed by tragic event on the index writer], failure=org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : exp
ected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt"))), markAsStale=true]]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/
ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1623) ~[?:?]

The logs on the data node - es-node122 during the period

[2023-06-19T10:00:58,070][WARN ][o.e.t.ThreadPool         ] [es-node122] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@d4b9692] on thread pool [sam
e]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
        at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:908) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:921) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.getFlushingBytes(IndexWriter.java:795) ~[lucene-core-9.6.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.getWritingBytes(InternalEngine.java:667) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard.getWritingBytes(IndexShard.java:1243) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:183) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:311) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:291) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:442) ~[elasticsearch-8.8.0.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1623) ~[?:?]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsear
ch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
        ... 3 more
[2023-06-19T10:00:58,107][WARN ][o.e.i.s.IndexShard       ] [es-node122] [firewall-2023.06.19][0] failed to flush index
org.elasticsearch.index.engine.FlushFailedEngineException: Flush failed
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2064) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1623) ~[?:?]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsear
ch/indices/ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
        ... 7 more
[2023-06-19T10:00:58,107][WARN ][o.e.i.e.Engine           ] [es-node122] [firewall-2023.06.19][0] failed engine [already closed by tragic event on the index writer]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1f4cc8c4 actual=3a847655 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/indices/
ZMBufdoNQrCXPYHhX6ZGCQ/0/index/_1w1.fdt")))
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:153) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5742) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:546) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:474) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:492) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:671) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3608) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4043) ~[lucene-core-9.6.0.jar:?]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4005) ~[lucene-core-9.6.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2709) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2052) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1384) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.index.shard.IndexShard$6.doRun(IndexShard.java:3663) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.0.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1623) ~[?:?]

There are no logs in the second data node (es-node120) for the period under observation.

The output of

GET firewall-2023.06.19/_search

is

{
  "error": {
    "root_cause": [
      {
        "type": "no_shard_available_action_exception",
        "reason": null
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "firewall-2023.06.19",
        "node": null,
        "reason": {
          "type": "no_shard_available_action_exception",
          "reason": null
        }
      }
    ]
  },
  "status": 503
}

Is there any solution to this that doesn't involve loss of data?

Which node is this one? Is this the es-node122?

Do you have logs for the other days, could you check if a similar information is present and which node id is present in the logs?

You may have a hardware issue in one of the nodes.

Yes, node[3-jgQcnUSQueHzNULoWB5g] is es-node122.

We have faced similar issues in the past with this particular node.

You may have a hardware issue in one of the nodes.

I am assuming that the suspect here is es-node122, what kind of hardware faults should we look out for?

We have run RAM test and Hard Drive tests and those were successful without any faults being detected.

Edit:

Each time this happens, a log file /var/log/logstash/ hs_err_pidnnnnn.log is created. The contents of the file created today are

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2023.06.19 17:48:30 =~=~=~=~=~=~=~=~=~=~=~=
cat /evar/log/elasticsearch/hs_err_pid90507.loghs_err_pid90507.log

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc8979b072b, pid=90507, tid=90520
#
# JRE version: OpenJDK Runtime Environment (20.0.1+9) (build 20.0.1+9-29)
# Java VM: OpenJDK 64-Bit Server VM (20.0.1+9-29, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x7b072b]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x1d2b
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

---------------  S U M M A R Y ------------

Command Line: -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-17445613721836697659 -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms7830m -Xmx7830m -XX:MaxDirectMemorySize=4106223616 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 -Des.distribution.type=deb --module-path=/usr/share/elasticsearch/lib --add-modules=jdk.net --add-modules=org.elasticsearch.preallocate -Djdk.module.main=org.elasticsearch.server org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch

Host: 12th Gen Intel(R) Core(TM) i7-12700T, 20 cores, 15G, Debian GNU/Linux 11 (bullseye)
Time: Mon Jun 19 12:02:46 2023 IST elapsed time: 26.873717 seconds (0d 0h 0m 26s)

---------------  T H R E A D  ---------------

Current thread (0x00007fc8900b7180):  WorkerThread "GC Thread#8" [stack: 0x00007fc8608e3000,0x00007fc8609e3000] [id=90520]

Stack: [0x00007fc8608e3000,0x00007fc8609e3000],  sp=0x00007fc8609e1bb0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x7b072b]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x1d2b
V  [libjvm.so+0x7ec4a2]  G1ParEvacuateFollowersClosure::do_void()+0x52
V  [libjvm.so+0x7ecd2b]  G1EvacuateRegionsTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int)+0x8b
V  [libjvm.so+0x7ea5ca]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x9a
V  [libjvm.so+0xf0dea0]  WorkerThread::run()+0x80
V  [libjvm.so+0xe598e6]  Thread::call_run()+0xa6
V  [libjvm.so+0xc895c8]  thread_native_entry(Thread*)+0xd8

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000100

---------------  T H R E A D  ---------------

Current thread (0x00007fc8900b7180):  WorkerThread "GC Thread#8" [stack: 0x00007fc8608e3000,0x00007fc8609e3000] [id=90520]

Stack: [0x00007fc8608e3000,0x00007fc8609e3000],  sp=0x00007fc8609e1bb0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x7b072b]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x1d2b
V  [libjvm.so+0x7ec4a2]  G1ParEvacuateFollowersClosure::do_void()+0x52
V  [libjvm.so+0x7ecd2b]  G1EvacuateRegionsTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int)+0x8b
V  [libjvm.so+0x7ea5ca]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x9a
V  [libjvm.so+0xf0dea0]  WorkerThread::run()+0x80
V  [libjvm.so+0xe598e6]  Thread::call_run()+0xa6
V  [libjvm.so+0xc895c8]  thread_native_entry(Thread*)+0xd8

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000100

Registers:
RAX=0x0000000000000000, RBX=0x00007fc83c000c10, RCX=0x0000000000000025, RDX=0x0000000000000001
RSP=0x00007fc8609e1bb0, RBP=0x00007fc8609e1c40, RSI=0x00007fc8984eafe0, RDI=0x00007fc8975f18a0
R8 =0x00000008010afe68, R9 =0x0000000000000000, R10=0x00000007da50f610, R11=0x0000000000000001
R12=0x00000007da50f625, R13=0x00000007f08bc5d8, R14=0x00007fc8985320d5, R15=0x0000000000000001
RIP=0x00007fc8979b072b, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e


Register to memory mapping:

RAX=0x0 is NULL
RBX=0x00007fc83c000c10 points into unknown readable memory: 0x00007fc898428418 | 18 84 42 98 c8 7f 00 00
RCX=0x0000000000000025 is an unknown value
RDX=0x0000000000000001 is an unknown value
RSP=0x00007fc8609e1bb0 points into unknown readable memory: 0x00007fc800000000 | 00 00 00 00 c8 7f 00 00
RBP=0x00007fc8609e1c40 points into unknown readable memory: 0x00007fc8609e1d00 | 00 1d 9e 60 c8 7f 00 00
RSI=0x00007fc8984eafe0: <offset 0x00000000012eafe0> in /usr/share/elasticsearch/jdk/lib/server/libjvm.so at 0x00007fc897200000
RDI=0x00007fc8975f18a0: <offset 0x00000000003f18a0> in /usr/share/elasticsearch/jdk/lib/server/libjvm.so at 0x00007fc897200000
R8 =0x00000008010afe68 is pointing into metadata
R9 =0x0 is NULL
R10=0x00000007da50f610 is an oop: org.apache.lucene.util.BytesRef 
{0x00000007da50f610} - klass: 'org/apache/lucene/util/BytesRef'
 - ---- fields (total size 3 words):
 - public 'offset' 'I' @12  0 (0x00000000)
 - public 'length' 'I' @16  9 (0x00000009)
 - public 'bytes' '[B' @20  
[error occurred during error reporting (printing register info), id 0xb, SIGSEGV (0xb) at pc=0x00007fc897e71b82]
Top of Stack: (sp=0x00007fc8609e1bb0)
0x00007fc8609e1bb0:   00007fc800000000 0000000000000001
0x00007fc8609e1bc0:   0000000000000009 00007fc8906f5b70
0x00007fc8609e1bd0:   0000000700000000 0000000800003258
0x00007fc8609e1be0:   00000007da50f600 00007fc8984eaff0
0x00007fc8609e1bf0:   000000000000000f 00000007da50f610
0x00007fc8609e1c00:   00017fc8609e1d70 0000000000000001
0x00007fc8609e1c10:   00007fc8609e1d70 00007fc8609e1d30
0x00007fc8609e1c20:   00007fc898536ca0 000000000000000b
0x00007fc8609e1c30:   00007fc83c000c10 00000000000003d8
0x00007fc8609e1c40:   00007fc8609e1d00 00007fc8979ec4a2
0x00007fc8609e1c50:   00007fc8609e1ce0 000000000000007d
0x00007fc8609e1c60:   00007fc8609e1cd0 00007fc89795ba03
0x00007fc8609e1c70:   0000000000000000 0000000000000000
0x00007fc8609e1c80:   00007ffff8000000 00007fc81bf2e130
0x00007fc8609e1c90:   00007fc8609e1cd0 000000000000000b
0x00007fc8609e1ca0:   0000000000000001 00007fc898729081
0x00007fc8609e1cb0:   000000000000000b 00007fc8609e1ce0
0x00007fc8609e1cc0:   00007fc814001030 00007fc897e93c36
0x00007fc8609e1cd0:   000000000008e8d8 00007fc81bf2e130
0x00007fc8609e1ce0:   00007fc814001030 000000000000000b
0x00007fc8609e1cf0:   000000000000000b 00000000000003d8
0x00007fc8609e1d00:   00007fc8609e1da0 00007fc8979ecd2b
0x00007fc8609e1d10:   0000000640df2130 0000000000000000
0x00007fc8609e1d20:   00007fc83c000c10 0000000000000000
0x00007fc8609e1d30:   00007fc898428990 0000000000000000
0x00007fc8609e1d40:   0000000000000000 0000000000000000
0x00007fc8609e1d50:   00007fc890042930 00007fc83c000c10
0x00007fc8609e1d60:   00007fc89004b1e0 00007fc81bf2e160
0x00007fc8609e1d70:   00007fc800000013 00007fc8900b7450
0x00007fc8609e1d80:   00007fc81bf2e130 00007fc8900b7450
0x00007fc8609e1d90:   00007fc81bf2e130 00007fc8900b7490
0x00007fc8609e1da0:   00007fc8609e1e10 00007fc8979ea5ca 

Instructions: (pc=0x00007fc8979b072b)
0x00007fc8979b062b:   78 01 00 00 44 0f 43 f8 48 8b 73 08 44 88 7d c6
0x00007fc8979b063b:   88 55 c7 4c 8d 3d af 2d b8 00 8b 8e 48 02 00 00
0x00007fc8979b064b:   48 8b 86 40 02 00 00 48 d3 e0 48 89 c1 4c 89 e8
0x00007fc8979b065b:   48 29 c8 41 8b 0f 48 d3 e8 48 8b 8e 28 02 00 00
0x00007fc8979b066b:   89 c0 48 8b 04 c1 8b b8 98 00 00 00 48 89 45 98
0x00007fc8979b067b:   48 8b 43 70 89 7d a0 84 d2 0f 85 3b 0f 00 00 48
0x00007fc8979b068b:   8b 48 10 89 fa 48 8b 14 d1 4c 8b 4a 30 48 8b 42
0x00007fc8979b069b:   38 4c 29 c8 48 c1 e8 03 4c 39 d8 0f 82 5e 0e 00
0x00007fc8979b06ab:   00 4b 8d 04 d9 48 89 42 30 4d 85 c9 0f 84 4d 0e
0x00007fc8979b06bb:   00 00 48 8d 05 5c b5 b3 00 48 8b 00 41 0f 18 0c
0x00007fc8979b06cb:   01 49 83 fb 08 0f 87 90 1f 00 00 48 8d 3d 97 b2
0x00007fc8979b06db:   86 00 4a 63 04 9f 48 01 f8 ff e0 4c 8d 35 e8 19
0x00007fc8979b06eb:   b8 00 41 0f b6 16 84 d2 0f 84 39 07 00 00 48 8d
0x00007fc8979b06fb:   35 e0 a8 b3 00 41 8b 45 08 8b 4e 08 48 d3 e0 48
0x00007fc8979b070b:   03 06 48 89 45 98 8b 48 08 85 c9 0f 8e cc 0f 00
0x00007fc8979b071b:   00 f6 c1 01 74 1a 48 8b 00 48 8d 3d 75 11 c4 ff
0x00007fc8979b072b:   48 8b 80 00 01 00 00 48 39 f8 0f 85 01 18 00 00
0x00007fc8979b073b:   c1 f9 03 4c 63 d1 80 7d a0 00 4c 89 5d c8 0f 85
0x00007fc8979b074b:   e7 0e 00 00 41 f6 c3 01 0f 84 ec 1d 00 00 4c 89
0x00007fc8979b075b:   d8 48 c1 e8 03 41 89 c0 41 83 e0 0f 44 3b 83 78
0x00007fc8979b076b:   01 00 00 19 d2 31 c0 83 c2 01 44 3b 83 78 01 00
0x00007fc8979b077b:   00 44 0f 43 f8 48 8b 73 08 44 88 7d c6 88 55 c7
0x00007fc8979b078b:   4c 8d 3d 62 2c b8 00 8b 8e 48 02 00 00 48 8b 86
0x00007fc8979b079b:   40 02 00 00 48 d3 e0 48 89 c1 4c 89 e8 48 29 c8
0x00007fc8979b07ab:   41 8b 0f 48 d3 e8 48 8b 8e 28 02 00 00 89 c0 48
0x00007fc8979b07bb:   8b 04 c1 8b b8 98 00 00 00 48 89 45 88 48 8b 43
0x00007fc8979b07cb:   70 89 7d 90 84 d2 0f 85 53 0e 00 00 48 8b 48 10
0x00007fc8979b07db:   89 fa 48 8b 14 d1 4c 8b 4a 30 48 8b 42 38 4c 29
0x00007fc8979b07eb:   c8 4c 89 4d b8 48 c1 e8 03 4c 39 d0 0f 82 e4 0d
0x00007fc8979b07fb:   00 00 4b 8d 04 d1 48 89 42 30 4d 85 c9 0f 84 d3
0x00007fc8979b080b:   0d 00 00 48 8d 05 0b b4 b3 00 48 8b 00 41 0f 18
0x00007fc8979b081b:   0c 01 49 83 fa 08 0f 87 e7 21 00 00 48 8d 15 6a 


Stack slot to memory mapping:
stack at sp + 0 slots: 0x00007fc800000000 points into unknown readable memory: 0x00007fc800000020 | 20 00 00 00 c8 7f 00 00
stack at sp + 1 slots: 0x0000000000000001 is an unknown value
stack at sp + 2 slots: 0x0000000000000009 is an unknown value
stack at sp + 3 slots: 0x00007fc8906f5b70 points into unknown readable memory: 0x00000007f0800000 | 00 00 80 f0 07 00 00 00
stack at sp + 4 slots: 0x0000000700000000 points into unknown readable memory: 0x0000000000000000 | 00 00 00 00 00 00 00 00
stack at sp + 5 slots: 0x0000000800003258 is pointing into metadata
stack at sp + 6 slots: 
[error occurred during error reporting (inspecting top of stack), id 0xb, SIGSEGV (0xb) at pc=0x00007fc89795bcce]



***A lot of data which exceeds the post length limit here.***

END.

Really it could be anything, but this manual page gives some troubleshooting guidance:

From the docs I linked above:

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

1 Like

What role does the replica shard play in cases where index corruption takes place?

Is there any configuration setting which can be made to restore the index from the replica shard?

In this case, it is a 3 node cluster with each node being on a distinct physical computer.

If it were possible to recover safely from the replica then Elasticsearch would do so automatically.

If this is happening frequently then you have a serious problem on one of your nodes. You must fix this first.

Looks like a change of hardware is in order!

Thank you :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.