Indexes are keep increasing after traffic towards elastic has stoped

John_Xanthopoulos · August 12, 2021, 10:03am

After having performed a test with indexing rate of ~30000/sec for couple of hours we noticed the following behavior. After Fluentd has stopped sending data to elasticsearch, four some time there is an ammount of documents missing in some of the indices. Then all of a sudden the missing documents are there. What I have also observed is that at this period of time where traffic has stopped and documents are missing, CPU utilization has dropped from ~9000m to 600m - 800m and then when the missing indices appear CPU drops to 20m. Can you please help me understand this behavior and how to resolve it so that the moment the traffics stops all the documents to be in place? Below you can see the health output while the test runs.

{
"active_primary_shards": 15,
"active_shards": 30,
"active_shards_percent_as_number": 100.0,
"cluster_name": "nc0299-admin-ns-zts6-ztsl6",
"delayed_unassigned_shards": 0,
"initializing_shards": 0,
"number_of_data_nodes": 2,
"number_of_in_flight_fetch": 0,
"number_of_nodes": 8,
"number_of_pending_tasks": 0,
"relocating_shards": 0,
"status": "green",
"task_max_waiting_in_queue_millis": 0,
"timed_out": false,
"unassigned_shards": 0
}

warkolm · August 12, 2021, 10:05am

What sort of timeframe are you talking here?

John_Xanthopoulos · August 12, 2021, 10:09am

for half an hour run they are there after 3 minutes. For 2 hours run its ~10 minutes.

Christian_Dahlqvist · August 12, 2021, 11:03am

Have you altered the refresh interval of the index to optimise indexing? If so what is it set to?

John_Xanthopoulos · August 12, 2021, 11:04am

No I havent, can you suggest what value to set and how to do that?

Christian_Dahlqvist · August 12, 2021, 11:08am

If you have not overridden the default it may be merging of segments that take place after the last refresh. If you want to know what it is doing, use the hot threads API during this period.

John_Xanthopoulos · August 12, 2021, 11:09am

So what do I do so that I dont have this delay?

Christian_Dahlqvist · August 12, 2021, 11:10am

Why is this delay causing problems?

John_Xanthopoulos · August 12, 2021, 11:13am

Expotentially as the time passes and traffic is there, the delay will become even higher. So I have a feature were we increase indexing rates in elasticsearch and we try to tune what it is needed so when the load stops all documents are there and visualized. so in 8 hours run I might have everything visualized after 1 hour..

Christian_Dahlqvist · August 12, 2021, 11:14am

First thing is to identify what it is doing. Use the hot threads API for that and share the output here.

Do you have any non-default settings at index or cluster level?

John_Xanthopoulos · August 12, 2021, 11:25am

only {"search.max_buckets": 240000}}' has a non default value in es configuration.

John_Xanthopoulos · August 12, 2021, 12:34pm

This is a hot thread sample for client and master pod

::: {zts6-ztsl6-belk-elasticsearch-master-1}{4Nn3YCBcSXGJ3Cqev6zmjA}{_q4JfPT6QxqvNAyqvqocIg}{xx.xx}{xx.xx:9300}{m}
Hot threads at 2021-08-12T12:33:56.244Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {zts6-ztsl6-belk-elasticsearch-client-555c85f959-bkfqx}{uCjIYmopSTWXVon4xzwzuQ}{PYeuegbZSJWvrOf6oPbgcQ}{xx.xx}{xx.xx:9300}{i}
Hot threads at 2021-08-12T12:33:56.244Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

John_Xanthopoulos · August 12, 2021, 12:36pm

and also getting a lot of these

::: {zts6-ztsl6-belk-elasticsearch-data-0}{K5SxFaUNSNi57oWvnPMNRA}{wxtUb2HEQl-g32D6W7EcZw}{xx.xx}{xx.xx:9300}{d}
Hot threads at 2021-08-12T12:36:06.407Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

92.6% (463ms out of 500ms) cpu usage by thread 'elasticsearch[zts6-ztsl6-belk-elasticsearch-data-0][write][T#4]'
6/10 snapshots sharing following 43 elements
app//org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468)
app//org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:555)
app//org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:402)
app//org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:480)
app//org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
app//org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213)
app//org.elasticsearch.index.engine.InternalEngine.addDocs(InternalEngine.java:1171)
app//org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1108)
app//org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:948)
app//org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:815)
app//org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:787)
app//org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:751)
app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:455)
app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:437)
app//org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:408)
app//org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:81)
app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:513)

John_Xanthopoulos · August 12, 2021, 12:37pm

@Christian_Dahlqvist This is what we need? The file I m taking the output every 1 minute cannot be uploaded here so let me know how can I share.. or what to do.

John_Xanthopoulos · August 12, 2021, 2:21pm

Last thread run after the load stopped was this one, but I am not sure what its doing.

::: {zts6-ztsl6-belk-elasticsearch-data-1}{KqVSJ1Z1SAGLVyDEtBiy_w}{RuwdblmCQfe1Aj7re9x7wQ}{192.168.222.65}{192.168.222.65:9300}{d}
Hot threads at 2021-08-12T13:44:38.102Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

99.6% (498.2ms out of 500ms) cpu usage by thread 'elasticsearch[zts6-ztsl6-belk-elasticsearch-data-1][[debug_logs-0-ntas07-2021-08-12][0]: Lucene Merge Thread #108]'
4/10 snapshots sharing following 14 elements
app//org.apache.lucene.index.DocIDMerger$SequentialDocIDMerger.next(DocIDMerger.java:99)
app//org.apache.lucene.index.MappingMultiPostingsEnum.nextDoc(MappingMultiPostingsEnum.java:103)
app//org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:133)
app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:937)
app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:347)
app//org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:197)
app//org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:245)
app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4463)
app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057)
app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101)
app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
6/10 snapshots sharing following 11 elements
app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:937)
app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:347)
app//org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:197)
app//org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:245)
app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4463)
app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057)
app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101)
app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)

Voula_Mikr · August 16, 2021, 1:41pm

Hi @Christian_Dahlqvist
Is there any update for this problem?

Christian_Dahlqvist · August 16, 2021, 2:16pm

Looks like it is indexing and merging. Maybe the queues built up during the load cycle and it takes a while to get through then, followed by some merging? Given that the index quueues are finite I would not expect this to scale with increased load.

Do you use very large bulk requests or do you maybe have very large and complex documents/mappings? Do you have slow storage?

Voula_Mikr · August 17, 2021, 12:31pm

By merging you mean Lucene segments merging?
Just to note that number of replicas is 1.
Is there any way to check if merging causes the problem?
Are there any other ways to change the configuration and see if there is any impact with our issue,
e.g. decrease number of replicas to 0, increase refresh_interval etc?
Thank you in advance!

Voula_Mikr · August 23, 2021, 6:35am

Any update here @Christian_Dahlqvist?

Christian_Dahlqvist · August 23, 2021, 6:40am

The hot threads dump indicates that merging is happening, so that seems to be what is keeping the node active. I do not see why this would be a problem though - Elasticsearch constantly merges unless you have altered the behaviour though custom settings.

What type of storage are you using? If you have very slow storage I guess merging might get throttled and take a long time. What is the full output of the cluster stats API?

Topic		Replies	Views
High CPU usage because lucene merge thread Elasticsearch	6	1774	July 15, 2021
Need help in undertsanding Elastic Hot Threads API output Elasticsearch	17	507	December 22, 2022
Merge Thread busy even when refresh_interval disabled for indexing Elasticsearch	9	8748	July 5, 2017
Uncontrolled merge process on few indices Elasticsearch	11	469	January 14, 2019
High CPU Usage across entire cluster Elasticsearch	9	1116	April 11, 2022

Indexes are keep increasing after traffic towards elastic has stoped

Related topics