Frequent shard failures

Hi,

We are running a 1node Elasticsearch Cluster on Elastic stack version 7.17.9. The node are running on computer with SSD storage and 128GB RAM. But when I indexing some data on elastic after some time indexing getting stops elastic service still running but my documents not able to index as well as not able to see old ingested data as well as when I see index status stack management it seen as red when I look for elastic logs I get following logs

[2024-12-27T15:00:00,356][WARN ][o.e.t.ThreadPool         ] [DESKTOP-NHDT04G] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@56e6384c] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:877) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:891) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.getFlushingBytes(IndexWriter.java:781) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.elasticsearch.index.engine.InternalEngine.getWritingBytes(InternalEngine.java:649) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.getWritingBytes(IndexShard.java:1296) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:184) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:312) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:292) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:444) [elasticsearch-7.17.9.jar:7.17.9]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 66192
	at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:207) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:230) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:75) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:116) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:165) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:186) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:974) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:527) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:491) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1444) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.elasticsearch.index.engine.InternalEngine.addDocs(InternalEngine.java:1310) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1248) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1051) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1066) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:998) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:900) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:320) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:181) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:245) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:134) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:196) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9]

I solve this problem using

elasticsearch-shard remove-corrupted-data --index mails --shard-id 0 --truncate-clean-translog -v

this command it solve my problem but I after some 2-3 hours I again face same problem. How to find cause of this issue and what is cause of this issue ? Please help urgently.
I go through the Troubleshooting corruption | Elasticsearch Guide [8.17] | Elastic
but didn't get how to find actual cause according to you what is recommended hardware architecture use to run Elasticsearch which not producing this issues or Whether I have to upgrade the elasticsearch with backward capability?
Please help me this is serious problem for me I have to solve it as soon as possible

Exactly what type of SSD storage do you have? Is it a local disk mounted on a physical machine?

I would also recommend you upgrade to at least the latest 7.17 release.

Yes it is local disk mounted on physical machine "Model name - WD Blue SN5000 4TB" till now total 1TB data is indexed.

Does upgration solve this issue permanently because with upgradation need to do changes in code as well so it will take but after this working does it really solve my issue?

You should not need to change any code to upgrade to the latest 7.17 release. I do not know if there are any issues related to this that may have been fixed in newer versions, so this will eliminate that as a potential issue. I suspect that there may be something wrong with your storage, and if that is the case upgrading will naturally not help. Have you checked your disk for issues?

Do you have anything else in your server that may be messing with elasticsearch data files, like some anti-virus or anything like that?

This behavior is not normal, I could suggested that something else is messing with your files or that you are starting to have hardware failure.

2 Likes

Please check system logs for IO related errors, eg /var/log/messages and similar log files. Output from "dmesg -T" might also be helpful, assuming Linux system of some sort. If this is Windows Anything, then ... risk of something else interfering with your data files is just a lot more significant, e.g. some anti-virus/anti-malware software doing something unhelpful.

From info provided it seems your storage is not stable, so it would be good time to make a snapshot/backup of your indices, e.g. attach another disk or use network storage. Unless it's trivial to re-index everything you have indexed already.

Obviously, a "1node Elasticsearch Cluster" is not ideal in terms of resilience, so I hope this is a test/POC system, and not a production/mission critical environment.

1 Like

There is only Microsoft Defender there and if it starts hardware failure how to check which hardware causes this issue how to check these things because I test all hardware, and it passes that test this physical setup is 2 month old only Is there any other way to check this or any other reason for this issue. Please help me sir

Is this Windows surely solely and purely used for running Elasticsearch ? Half-joking, but have you considered not running Windows on this server? Linux is ... better.

A "hardware issue" is just the current best suggestion based on the limited information you shared. Share some more details on how the problem presents itself, what you are doing, your config, and maybe some other theory might appear.

As pointed out, if anything starts messing around with the files on the data partition, this can corrupt them. So check the system logs, the Microsoft Defender logs, and look for anything around storage/files.

Given what I know now, I'd strongly suggest you check the S.M.A.R.T data of your storage on periodic basis, There are many tools for this, as I'm not a windows fan I cannot recommend a specific one.

If you happen to have another server, maybe consider to duplicate your environment and see if the problem is seen there too.

And again, now is really good time to make backups/snapshots !!