We have ES OSS 7.0 running for more than couple of months. In last 15 days, ES froze and became very slow. It is almost unable to process any request.
Setup:
Single Node, Windows VM, 16 cores, 32GB RAM, ES Xmx 4G, Total ES indices data ~2GB.
Symptoms:
During both of the occurences, we noticed below:
ES logs start showing slowness symptom with following lines
failed to update shard information for clusterinfoupdatejob within 15s timeout
failed to update node information for clusterinfoupdatejob within 15s timeout
After sometime, index requests start getting timed out. We use ES high level rest client which starts getting timed out after 30 seconds.
After sometime of this, ES start rejecting request as its request_queue size becomes full (200).
We did monitor the machine and ES at the time of issue:
ES -> All shards are yellow or green. No red shards. Querying /hot_threads api gets timed out. Even sometime, node statistics are timed out or unresponsive. Checking node statistics data, we did not find anything alarming except rejection count.
Windows -> We don't see any high pressure on cpu, memory. Disk I/O is also quite OK. Nothing extraordinary here, except some antivirus running.
Logs:
[WARN ][o.e.c.InternalClusterInfoService] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[ ERROR][o.e.a.b.TransportBulkAction] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.ingest.IngestService$4@13da6e35](mailto:org.elasticsearch.ingest.IngestService$4@13da6e35) on EsThreadPoolExecutor[name = redacted/write, queue capacity = 200, [org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running](mailto:org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running), pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 1158236]]
We are unable to get around what can cause this issue. It seems that ES is stuck or very slow due to some reason and unable to recover even after 2-3 days.
Hi @MarkWalkom, I have updated the question with _cluster/stats o/p. I don't see anything out of oridnary here except the rejected count for indexing. We are planning to upgrade to ES7.10 but want to make sure that we understand the issue first.
I would recommend temporarily disabling antivirus to see if that has any impact. Have seen some types of antivirus have huge impact on write performance in the past and indexing is quite I/O intensive.
Ok. We will check after upgrading to latest ES version.
At the time of issue, I can see from windows event log, antivirus was running. And antivirus run for 3-4 hours and then stopped.
Is it possible that it permanently slowed down ES and ES could not recover fom it even after 2-3 days?
We need to restart ES to solve the issue.
We did not yet upgrade ES version (due to some other issues, still 7.0) and the issue popped up again.
This time, we could capture the hot_threads output.
Can we get any idea from it, what went wrong here. From the face of it, seems like threads are stuck at Disk I/O. Just note that this ES server is running on VMWare Windows VM.
Any help would be very much appreciated.
Hot threads at 2021-10-06T16:01:16.482Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
9.4% (46.8ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][snapshot][T#16]'
10/10 snapshots sharing following 19 elements
sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
java.nio.file.Files.newOutputStream(Files.java:216)
org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:132)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1335)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1274)
org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:846)
org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:368)
org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:93)
org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:314)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][snapshot][T#15]'
10/10 snapshots sharing following 19 elements
sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
java.nio.file.Files.newOutputStream(Files.java:216)
org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:132)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1335)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1274)
org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:846)
org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:368)
org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:93)
org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:314)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[hostxxx][write][T#4]'
4/10 snapshots sharing following 63 elements
sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
java.nio.file.Files.newOutputStream(Files.java:216)
org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:411)
org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:407)
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:255)
org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
org.elasticsearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:130)
org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:112)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)
org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:39)
org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:46)
org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:366)
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:402)
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
org.apache.lucene.index.IndexWriter.softUpdateDocument(IndexWriter.java:1654)
org.elasticsearch.index.engine.InternalEngine.updateDocs(InternalEngine.java:1204)
org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1033)
org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:864)
org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789)
org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762)
org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:719)
org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$executeIndexRequestOnPrimary$3(TransportShardBulkAction.java:452)
org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$2452/1179146316.get(Unknown Source)
org.elasticsearch.action.bulk.TransportShardBulkAction.executeOnPrimaryWhileHandlingMappingUpdates(TransportShardBulkAction.java:475)
org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:450)
org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:218)
org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:161)
org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:153)
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:141)
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:79)
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1033)
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1011)
org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:105)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:413)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:359)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$2340/749012392.accept(Unknown Source)
org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269)
org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236)
org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2512)
org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:970)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:358)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:313)
org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:305)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
6/10 snapshots sharing following 63 elements
sun.nio.fs.WindowsNativeDispatcher.CreateFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.CreateFile(WindowsNativeDispatcher.java:71)
sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:162)
sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:225)
java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
java.nio.file.Files.newOutputStream(Files.java:216)
org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:411)
org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:407)
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:255)
org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
org.elasticsearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:130)
org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:74)
org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:115)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)
}]
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.