High CPU usage caused by Lucene merge thread

kent010341 · September 19, 2022, 7:19am

I run ELK stack, APM server, and Metricbeat (Docker and System module) with Docker.

The Logstash is used for reading log files, and both the APM server and Metricbeat are for monitoring with a period of 10 seconds.

Sometimes a high CPU usage occurs at Elasticsearch, and with Nodes hot threads API I found that it's caused by the Lucene Merge Thread.

The CPU usage sometimes even reaches over 200 % (use docker stats to monitor) and causes an HTTP error while trying to access a Kibana dashboard or Elasticsearch API at this moment.

I found some solutions for this problem, says that setting index.merge.scheduler.max_thread_count to 1 might help, but adding this setting to the config/elasticsearch.yml (mounted with Docker volume) causes an error: java.lang.IllegalArgumentException: node settings must not contain any index level settings.

Besides, I found that the period of high CPU occurrence is similar to the period of monitoring by APM server and Metricbeat, maybe the root cause is that there are too many documents written to Elasticsearch?

Is there any proper solution to this problem? Or what configurations should I try to tune to fit my environment?

Hot thread output:

::: {elasticsearch}{WPi6_SLRTiiDXAt41tpmCQ}{AAM0DuSZSUCT_KYSEvNYrQ}{172.18.0.7}{172.18.0.7:9300}{dimr}
   Hot threads at 2022-09-15T07:55:38.921Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   63.2% (316.2ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch][[metricbeat-7.12.0-2022.09.15][0]: Lucene Merge Thread #147]'
     4/10 snapshots sharing following 12 elements
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.addSortedNumericField(Lucene80DocValuesConsumer.java:705)
       app//org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:375)
       app//org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:147)
       app//org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:155)
       app//org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
     3/10 snapshots sharing following 10 elements
       app//org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:506)
       app//org.apache.lucene.codecs.lucene86.Lucene86PointsWriter.merge(Lucene86PointsWriter.java:237)
       app//org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:201)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:161)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
     3/10 snapshots sharing following 9 elements
       app//org.apache.lucene.codecs.lucene86.Lucene86PointsWriter.merge(Lucene86PointsWriter.java:237)
       app//org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:201)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:161)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)

Docker compose file:

version: "3.7"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
    container_name: elasticsearch
    hostname: elasticsearch
    volumes:
      - esdata:/usr/share/elasticsearch/data
      - ./elk/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      - discovery.type=single-node

  kibana:
    image: docker.elastic.co/kibana/kibana-oss:7.10.2
    container_name: kibana
    hostname: kibana
    ports:
      - "5601:5601"
    environment:
      SERVERNAME: kibana
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200

  apm-server:
    image: docker.elastic.co/apm/apm-server-oss:7.12.0
    volumes:
      - ./elk/apm-server.docker.yml:/usr/share/apm-server/apm-server.yml:ro
    container_name: apm-server
    hostname: apm-server
    ports:
      - "8200:8200"
    environment:
      - output.elasticsearch.hosts=http://elasticsearch:9200

  logstash:
    image: docker.elastic.co/logstash/logstash-oss:7.12.0
    container_name: logstash
    links:
      - elasticsearch
    volumes:
      - ./elk/logstash.conf:/usr/share/logstash/pipeline/logstash.conf
      - /data/log:/usr/share/logstash/proj_log:ro
    ports:
      - "5044:5044"

  metricbeat:
    image: docker.elastic.co/beats/metricbeat-oss:7.12.0
    container_name: metricbeat
    user: root
    volumes:
      - ./elk/metricbeat.yml:/usr/share/metricbeat/metricbeat.yml:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro
      - /proc:/hostfs/proc:ro
      - /:/hostfs:ro
    environment:
      - setup.kibana.host=kibana:5601
      - output.elasticsearch.hosts=["http://elasticsearch:9200"]
    command: ["-system.hostfs=/hostfs"]

volumes:
  esdata:

metricbeat.yml

 metricbeat.modules:
- module: docker
  metricsets:
    - "cpu"
    - "diskio"
    - "memory"
    - "network"
  hosts: ["unix:///var/run/docker.sock"]
  period: 10s
  enabled: true

- module: system
  metricsets:
    - cpu
    - load
    - memory
    - network
    - diskio
    - fsstat
  enabled: true
  period: 10s
  processes: ['.*']

  cpu.metrics:  ["percentages","normalized_percentages"]
  core.metrics: ["percentages"]

  process.include_top_n:
    by_cpu: 5
    by_memory: 5

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]

Christian_Dahlqvist · September 19, 2022, 7:23am

I believe it is possible this could be a sign that your Elasticsearch storage is slow. What type of storage are you using?

kent010341 · September 19, 2022, 7:39am

@Christian_Dahlqvist If you are asking whether I am using HDD or SSD, the storage is HDD. Deployed with a Ubuntu 20.04 VM running on VMware ESXi.

Christian_Dahlqvist · September 19, 2022, 7:44am

That may be part of the problem. I recommend you look at iostat and check await and disk utilisation.

kent010341 · September 19, 2022, 8:07am

@Christian_Dahlqvist Thanks for the reply.

I use iostat -d -x 1.

The r_await is 0.0 and w_await has a very small value at device sda, and I don't know which field represents disk utilization, is it %util? If so, the maximum value is only 1.2.

I found that the %wrqm frequently has a value greater than 50, maybe that's a problem?

iostat -d -x 1 output (one of all the outputs):

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00   50.00    268.00     0.00   0.00    0.00     5.36    0.00      0.00     0.00   0.00    0.00     0.00    0.00   1.20
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
sda              0.00      0.00     0.00   0.00    0.00     0.00   11.00    268.00    39.00  78.00    0.18    24.36    0.00      0.00     0.00   0.00    0.00     0.00    0.00   1.20
scd0             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00

iostat output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.52    0.01    7.61    0.05    0.00   79.82

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0             14.04         3.76       105.57       389.58   10469413  293820780 1084264988
loop0             0.00         0.00         0.00         0.00       2601          0          0
loop1             0.01         0.01         0.00         0.00      21868          0          0
loop2             0.00         0.00         0.00         0.00       3576          0          0
loop3             0.00         0.00         0.00         0.00       4621          0          0
loop4             0.00         0.00         0.00         0.00       3114          0          0
loop5             0.00         0.00         0.00         0.00       3111          0          0
loop6             0.00         0.00         0.00         0.00        814          0          0
loop7             0.01         0.01         0.00         0.00      33305          0          0
loop8             0.00         0.00         0.00         0.00        373          0          0
loop9             0.00         0.00         0.00         0.00          8          0          0
sda               6.53         3.82       105.64       390.00   10633483  294006180 1085431816
scd0              0.00         0.01         0.00         0.00      27284          0          0

kent010341 · September 19, 2022, 9:07am

I increase the period of Metricbeat to 1 minute, but whenever the monitoring happens, the CPU usage still increases to over 100 %.

I guess the documents generated each time are large enough to trigger a "merge", so what I have to do maybe is to try to optimize the metricbeat configuration file (metricbeat.yml) like disable the metricset.

Besides, is it possible to make it don't use that much CPU usage?

system · October 17, 2022, 9:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU usage because lucene merge thread Elasticsearch	6	1811	July 15, 2021
High CPU usage question Elasticsearch	6	1768	July 6, 2017
Cyclical ES CPU usage Elasticsearch	9	395	July 6, 2017
High CPU Usage across entire cluster Elasticsearch	9	1132	April 11, 2022
Lucene Merge Thread causing high CPU usage Elasticsearch	3	1699	May 17, 2021

High CPU usage caused by Lucene merge thread

Related topics