ElasticSearch freezes suddenly and unable to respond

Vishal_Goel · September 15, 2021, 7:45am

Hi,

We have ES OSS 7.0 running for more than couple of months. In last 15 days, ES froze and became very slow. It is almost unable to process any request.

Setup:
Single Node, Windows VM, 16 cores, 32GB RAM, ES Xmx 4G, Total ES indices data ~2GB.

Symptoms:
During both of the occurences, we noticed below:

ES logs start showing slowness symptom with following lines
failed to update shard information for clusterinfoupdatejob within 15s timeout
failed to update node information for clusterinfoupdatejob within 15s timeout
After sometime, index requests start getting timed out. We use ES high level rest client which starts getting timed out after 30 seconds.
After sometime of this, ES start rejecting request as its request_queue size becomes full (200).

We did monitor the machine and ES at the time of issue:
ES -> All shards are yellow or green. No red shards. Querying /hot_threads api gets timed out. Even sometime, node statistics are timed out or unresponsive. Checking node statistics data, we did not find anything alarming except rejection count.

Windows -> We don't see any high pressure on cpu, memory. Disk I/O is also quite OK. Nothing extraordinary here, except some antivirus running.

Logs:

[WARN ][o.e.c.InternalClusterInfoService] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[ ERROR][o.e.a.b.TransportBulkAction] failed to execute pipeline for a bulk request

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.ingest.IngestService$4@13da6e35](mailto:org.elasticsearch.ingest.IngestService$4@13da6e35) on EsThreadPoolExecutor[name = redacted/write, queue capacity = 200, [org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running](mailto:org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running), pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 1158236]]

We are unable to get around what can cause this issue. It seems that ES is stuck or very slow due to some reason and unable to recover even after 2-3 days.

/_cluster/stats ouutput at the time of issue: ES _/cluster/stats - Pastebin.com

How can we further debug on this issue?

Christian_Dahlqvist · September 15, 2021, 9:07am

How much data do you have in the cluster? How many indices and shards do you have in the cluster?

Vishal_Goel · September 15, 2021, 9:26am

Hi,

Total data <3GB
# of Indices -> 15, one shard per index
One node in a cluster.

warkolm · September 15, 2021, 10:09pm

7.0 is long past EOL, please upgrade ASAP.

What is the output from the _cluster/stats?pretty&human API?

Vishal_Goel · September 16, 2021, 4:09am

Hi @MarkWalkom, I have updated the question with _cluster/stats o/p. I don't see anything out of oridnary here except the rejected count for indexing. We are planning to upgrade to ES7.10 but want to make sure that we understand the issue first.

warkolm · September 16, 2021, 4:14am

You would be far better off upgrading to 7.14, as it's latest.

Vishal_Goel · September 16, 2021, 4:28am

Ok, we will check on that.
Anyway, any insight on this issue?

dadoonet · September 16, 2021, 5:03am

So many things have been fixed since 7.0. I'd upgrade and see if this fixes the problem you are seeing.

Christian_Dahlqvist · September 16, 2021, 5:10am

I would recommend temporarily disabling antivirus to see if that has any impact. Have seen some types of antivirus have huge impact on write performance in the past and indexing is quite I/O intensive.

Vishal_Goel · September 17, 2021, 7:44am

Ok. We will check after upgrading to latest ES version.
At the time of issue, I can see from windows event log, antivirus was running. And antivirus run for 3-4 hours and then stopped.
Is it possible that it permanently slowed down ES and ES could not recover fom it even after 2-3 days?
We need to restart ES to solve the issue.

Vishal_Goel · October 8, 2021, 1:52am

Hi All,

We did not yet upgrade ES version (due to some other issues, still 7.0) and the issue popped up again.
This time, we could capture the hot_threads output.
Can we get any idea from it, what went wrong here. From the face of it, seems like threads are stuck at Disk I/O. Just note that this ES server is running on VMWare Windows VM.

Any help would be very much appreciated.

Hot threads at 2021-10-06T16:01:16.482Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    9.4% (46.8ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][snapshot][T#16]'
     10/10 snapshots sharing following 19 elements
       sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
       sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
       sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
       sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
       java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
       java.nio.file.Files.newOutputStream(Files.java:216)
       org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:132)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1335)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1274)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:846)
       org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:368)
       org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:93)
       org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:314)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][snapshot][T#15]'
     10/10 snapshots sharing following 19 elements
       sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
       sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
       sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
       sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
       java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
       java.nio.file.Files.newOutputStream(Files.java:216)
       org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:132)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1335)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1274)
       org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:846)
       org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:368)
       org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:93)
       org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:314)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][write][T#4]'
     4/10 snapshots sharing following 63 elements
       sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
       sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
       sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
       sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
       java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
       java.nio.file.Files.newOutputStream(Files.java:216)
       org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:411)
       org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:407)
       org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:255)
       org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
       org.elasticsearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:130)
       org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
       org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
       org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:112)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
       org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)
       org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:39)
       org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:46)
       org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:366)
       org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:402)
       org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
       org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
       org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
       org.apache.lucene.index.IndexWriter.softUpdateDocument(IndexWriter.java:1654)
       org.elasticsearch.index.engine.InternalEngine.updateDocs(InternalEngine.java:1204)
       org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1033)
       org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:864)
       org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789)
       org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762)
       org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:719)
       org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$executeIndexRequestOnPrimary$3(TransportShardBulkAction.java:452)
       org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$2452/1179146316.get(Unknown Source)
       org.elasticsearch.action.bulk.TransportShardBulkAction.executeOnPrimaryWhileHandlingMappingUpdates(TransportShardBulkAction.java:475)
       org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:450)
       org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:218)
       org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:161)
       org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:153)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:141)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:79)
       org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1033)
       org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1011)
       org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:105)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:413)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:359)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$2340/749012392.accept(Unknown Source)
       org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
       org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269)
       org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236)
       org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2512)
       org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:970)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:358)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:313)
       org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:305)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     6/10 snapshots sharing following 63 elements
       sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
       sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
       sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
       sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
       java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
       java.nio.file.Files.newOutputStream(Files.java:216)
       org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:411)
       org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:407)
       org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:255)
       org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
       org.elasticsearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:130)
       org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
       org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
       org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:115)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
       org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)

       
       }]

Vishal_Goel · October 8, 2021, 3:52am

Hi @dadoonet ,
Please moderate the post. I have already flagged it with the reason.

system · November 5, 2021, 3:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1576	March 2, 2021
Troubleshooting cluster wide performance slow downs Elasticsearch	1	430	July 6, 2017
New Elasticsearch 7.6.0 cluster eventually becomes unresponsive Elasticsearch	3	369	April 13, 2020
Elasticsearch cluster [6.8] becomes unresponsive for a small duration when one of the nodes in the cluster does not respond to any requests & is not part of the cluster. Is this the expected behaviour? Elasticsearch	5	237	November 1, 2022
Elasticsearch stop working properly Elasticsearch	2	360	September 28, 2020

ElasticSearch freezes suddenly and unable to respond

Related Topics