High CPU usage caused by Lucene merge thread

I run ELK stack, APM server, and Metricbeat (Docker and System module) with Docker.

The Logstash is used for reading log files, and both the APM server and Metricbeat are for monitoring with a period of 10 seconds.

Sometimes a high CPU usage occurs at Elasticsearch, and with Nodes hot threads API I found that it's caused by the Lucene Merge Thread.

The CPU usage sometimes even reaches over 200 % (use docker stats to monitor) and causes an HTTP error while trying to access a Kibana dashboard or Elasticsearch API at this moment.

I found some solutions for this problem, says that setting index.merge.scheduler.max_thread_count to 1 might help, but adding this setting to the config/elasticsearch.yml (mounted with Docker volume) causes an error: java.lang.IllegalArgumentException: node settings must not contain any index level settings.

Besides, I found that the period of high CPU occurrence is similar to the period of monitoring by APM server and Metricbeat, maybe the root cause is that there are too many documents written to Elasticsearch?

Is there any proper solution to this problem? Or what configurations should I try to tune to fit my environment?


Hot thread output:

::: {elasticsearch}{WPi6_SLRTiiDXAt41tpmCQ}{AAM0DuSZSUCT_KYSEvNYrQ}{172.18.0.7}{172.18.0.7:9300}{dimr}
   Hot threads at 2022-09-15T07:55:38.921Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   63.2% (316.2ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch][[metricbeat-7.12.0-2022.09.15][0]: Lucene Merge Thread #147]'
     4/10 snapshots sharing following 12 elements
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.addSortedNumericField(Lucene80DocValuesConsumer.java:705)
       app//org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:375)
       app//org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:147)
       app//org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:155)
       app//org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
     3/10 snapshots sharing following 10 elements
       app//org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:506)
       app//org.apache.lucene.codecs.lucene86.Lucene86PointsWriter.merge(Lucene86PointsWriter.java:237)
       app//org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:201)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:161)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
     3/10 snapshots sharing following 9 elements
       app//org.apache.lucene.codecs.lucene86.Lucene86PointsWriter.merge(Lucene86PointsWriter.java:237)
       app//org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:201)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:161)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
       app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
       app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)

Docker compose file:

version: "3.7"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
    container_name: elasticsearch
    hostname: elasticsearch
    volumes:
      - esdata:/usr/share/elasticsearch/data
      - ./elk/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      - discovery.type=single-node

  kibana:
    image: docker.elastic.co/kibana/kibana-oss:7.10.2
    container_name: kibana
    hostname: kibana
    ports:
      - "5601:5601"
    environment:
      SERVERNAME: kibana
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200

  apm-server:
    image: docker.elastic.co/apm/apm-server-oss:7.12.0
    volumes:
      - ./elk/apm-server.docker.yml:/usr/share/apm-server/apm-server.yml:ro
    container_name: apm-server
    hostname: apm-server
    ports:
      - "8200:8200"
    environment:
      - output.elasticsearch.hosts=http://elasticsearch:9200

  logstash:
    image: docker.elastic.co/logstash/logstash-oss:7.12.0
    container_name: logstash
    links:
      - elasticsearch
    volumes:
      - ./elk/logstash.conf:/usr/share/logstash/pipeline/logstash.conf
      - /data/log:/usr/share/logstash/proj_log:ro
    ports:
      - "5044:5044"

  metricbeat:
    image: docker.elastic.co/beats/metricbeat-oss:7.12.0
    container_name: metricbeat
    user: root
    volumes:
      - ./elk/metricbeat.yml:/usr/share/metricbeat/metricbeat.yml:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro
      - /proc:/hostfs/proc:ro
      - /:/hostfs:ro
    environment:
      - setup.kibana.host=kibana:5601
      - output.elasticsearch.hosts=["http://elasticsearch:9200"]
    command: ["-system.hostfs=/hostfs"]

volumes:
  esdata:

metricbeat.yml

 metricbeat.modules:
- module: docker
  metricsets:
    - "cpu"
    - "diskio"
    - "memory"
    - "network"
  hosts: ["unix:///var/run/docker.sock"]
  period: 10s
  enabled: true

- module: system
  metricsets:
    - cpu
    - load
    - memory
    - network
    - diskio
    - fsstat
  enabled: true
  period: 10s
  processes: ['.*']

  cpu.metrics:  ["percentages","normalized_percentages"]
  core.metrics: ["percentages"]

  process.include_top_n:
    by_cpu: 5
    by_memory: 5

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]

I believe it is possible this could be a sign that your Elasticsearch storage is slow. What type of storage are you using?

@Christian_Dahlqvist If you are asking whether I am using HDD or SSD, the storage is HDD. Deployed with a Ubuntu 20.04 VM running on VMware ESXi.

That may be part of the problem. I recommend you look at iostat and check await and disk utilisation.

@Christian_Dahlqvist Thanks for the reply.

I use iostat -d -x 1.

The r_await is 0.0 and w_await has a very small value at device sda, and I don't know which field represents disk utilization, is it %util? If so, the maximum value is only 1.2.

I found that the %wrqm frequently has a value greater than 50, maybe that's a problem?


iostat -d -x 1 output (one of all the outputs):

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00   50.00    268.00     0.00   0.00    0.00     5.36    0.00      0.00     0.00   0.00    0.00     0.00    0.00   1.20
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
sda              0.00      0.00     0.00   0.00    0.00     0.00   11.00    268.00    39.00  78.00    0.18    24.36    0.00      0.00     0.00   0.00    0.00     0.00    0.00   1.20
scd0             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00

iostat output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.52    0.01    7.61    0.05    0.00   79.82

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0             14.04         3.76       105.57       389.58   10469413  293820780 1084264988
loop0             0.00         0.00         0.00         0.00       2601          0          0
loop1             0.01         0.01         0.00         0.00      21868          0          0
loop2             0.00         0.00         0.00         0.00       3576          0          0
loop3             0.00         0.00         0.00         0.00       4621          0          0
loop4             0.00         0.00         0.00         0.00       3114          0          0
loop5             0.00         0.00         0.00         0.00       3111          0          0
loop6             0.00         0.00         0.00         0.00        814          0          0
loop7             0.01         0.01         0.00         0.00      33305          0          0
loop8             0.00         0.00         0.00         0.00        373          0          0
loop9             0.00         0.00         0.00         0.00          8          0          0
sda               6.53         3.82       105.64       390.00   10633483  294006180 1085431816
scd0              0.00         0.01         0.00         0.00      27284          0          0

I increase the period of Metricbeat to 1 minute, but whenever the monitoring happens, the CPU usage still increases to over 100 %.

I guess the documents generated each time are large enough to trigger a "merge", so what I have to do maybe is to try to optimize the metricbeat configuration file (metricbeat.yml) like disable the metricset.

Besides, is it possible to make it don't use that much CPU usage?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.