High CPU usage because lucene merge thread

My operation process is as follows:

At 3 am everyday, program try to index data to ES, that usually takes 3 hours. During the indexing process, I will set the number_of_replicas parameter to 0, and when data indexing is completed, I will change the number_of_replicas to 1, and then wait for cluster health turn to green. at last, I change the index alias from the previous index (created on previous day) to point to the newly created index.

During the whole indexing process, the CPU usage of ES cluster was always normal, but when the alias switch complete, the CPU usage suddenly increased. by using the hot_thread API, I saw a lot of Lucene Merge threads.

The ES version I used was 7.8.1, Cluster have six nodes and each nodes have 16G heap memory. the disk usage is well and disk type is SSD. the index size is 55G.

I tried to change the index.merge.schedular.max_thread_count parameter to 1, but it doesn't seem to improve.

Any suggestions are needed.

Best regards.

Is it causing issues?

Yes, some requests to query ES will return 502 errors

What is the output from the _cluster/stats?pretty&human API?
Can you post the hot threads and Elasticsearch logs from the time this happens?

the output is:

{
  "_nodes" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "cluster_name" : "cluster",
  "timestamp" : 1623897005490,
  "status" : "green",
  "indices" : {
    "count" : 58,
    "shards" : {
      "total" : 339,
      "primaries" : 181,
      "replication" : 0.8729281767955801,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 32,
          "avg" : 5.844827586206897
        },
        "primaries" : {
          "min" : 1,
          "max" : 16,
          "avg" : 3.1206896551724137
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.9655172413793104
        }
      }
    },
    "docs" : {
      "count" : 320479456,
      "deleted" : 49848119
    },
    "store" : {
      "size" : "1.4tb",
      "size_in_bytes" : 1561367960432
    },
    "fielddata" : {
      "memory_size" : "1.9gb",
      "memory_size_in_bytes" : 2140146480,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "4.9gb",
      "memory_size_in_bytes" : 5315294942,
      "total_count" : 12515270180,
      "hit_count" : 2377814364,
      "miss_count" : 10137455816,
      "cache_size" : 415249,
      "cache_count" : 60055189,
      "evictions" : 59639940
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4101,
      "memory" : "348.6mb",
      "memory_in_bytes" : 365606982,
      "terms_memory" : "242.3mb",
      "terms_memory_in_bytes" : 254158536,
      "stored_fields_memory" : "7.9mb",
      "stored_fields_memory_in_bytes" : 8376536,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "50.9mb",
      "norms_memory_in_bytes" : 53434880,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "47.3mb",
      "doc_values_memory_in_bytes" : 49637030,
      "index_writer_memory" : "1.7gb",
      "index_writer_memory_in_bytes" : 1871176000,
      "version_map_memory" : "68.2mb",
      "version_map_memory_in_bytes" : 71529626,
      "fixed_bit_set" : "1.3mb",
      "fixed_bit_set_memory_in_bytes" : 1460736,
      "max_unsafe_auto_id_timestamp" : 1623888002457,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "boolean",
          "count" : 443,
          "index_count" : 42
        },
        {
          "name" : "date",
          "count" : 370,
          "index_count" : 50
        },
        {
          "name" : "double",
          "count" : 48,
          "index_count" : 15
        },
        {
          "name" : "flattened",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 420,
          "index_count" : 23
        },
        {
          "name" : "geo_point",
          "count" : 15,
          "index_count" : 15
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "half_float",
          "count" : 56,
          "index_count" : 14
        },
        {
          "name" : "integer",
          "count" : 1915,
          "index_count" : 31
        },
        {
          "name" : "keyword",
          "count" : 2661,
          "index_count" : 57
        },
        {
          "name" : "long",
          "count" : 1499,
          "index_count" : 48
        },
        {
          "name" : "nested",
          "count" : 68,
          "index_count" : 20
        },
        {
          "name" : "object",
          "count" : 1520,
          "index_count" : 37
        },
        {
          "name" : "search_as_you_type",
          "count" : 5,
          "index_count" : 5
        },
        {
          "name" : "short",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "text",
          "count" : 7627,
          "index_count" : 42
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [
        {
          "name" : "pattern_replace",
          "count" : 34,
          "index_count" : 19
        }
      ],
      "tokenizer_types" : [
        {
          "name" : "pattern",
          "count" : 45,
          "index_count" : 15
        }
      ],
      "filter_types" : [
        {
          "name" : "edgeNGram",
          "count" : 15,
          "index_count" : 15
        },
        {
          "name" : "synonym",
          "count" : 26,
          "index_count" : 20
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 170,
          "index_count" : 20
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 60,
          "index_count" : 15
        },
        {
          "name" : "standard",
          "count" : 50,
          "index_count" : 20
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 155,
          "index_count" : 20
        },
        {
          "name" : "stop",
          "count" : 15,
          "index_count" : 15
        },
        {
          "name" : "trim",
          "count" : 30,
          "index_count" : 15
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "coordinating_only" : 0,
      "data" : 6,
      "ingest" : 6,
      "master" : 6,
      "ml" : 6,
      "remote_cluster_client" : 6,
      "transform" : 6,
      "voting_only" : 0
    },
    "versions" : [
      "7.8.1"
    ],
    "os" : {
      "available_processors" : 40,
      "allocated_processors" : 40,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 6
        }
      ],
      "mem" : {
        "total" : "0b",
        "total_in_bytes" : 0,
        "free" : "1.2gb",
        "free_in_bytes" : 1312112640,
        "used" : "0b",
        "used_in_bytes" : 0,
        "free_percent" : 0,
        "used_percent" : 0
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 416
      },
      "open_file_descriptors" : {
        "min" : 772,
        "max" : 1630,
        "avg" : 996
      }
    },
    "jvm" : {
      "max_uptime" : "32.9d",
      "max_uptime_in_millis" : 2848484285,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 6
        }
      ],
      "mem" : {
        "heap_used" : "43.1gb",
        "heap_used_in_bytes" : 46335750144,
        "heap_max" : "80gb",
        "heap_max_in_bytes" : 85899345920
      },
      "threads" : 1086
    },
    "fs" : {
      "total" : "2.6tb",
      "total_in_bytes" : 2958485848064,
      "free" : "1.2tb",
      "free_in_bytes" : 1364500041728,
      "available" : "1.1tb",
      "available_in_bytes" : 1214075523072
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 6
      },
      "http_types" : {
        "netty4" : 6
      }
    },
    "discovery_types" : {
      "zen" : 6
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 6
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

the hot threads:

::: {es02}{dilmrt}{ml.machine_memory=9223372036854771712, rack=r1, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
   Hot threads at 2021-06-09T09:13:52.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   84.1% (420.2ms out of 500ms) cpu usage by thread 'elasticsearch[es02][[fysearch20210609070500][0]: Lucene Merge Thread #9]'
     3/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:159)
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
     3/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:151)
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
     4/10 snapshots sharing following 7 elements
       app//org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:229)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:106)
   
   56.6% (282.8ms out of 500ms) cpu usage by thread 'elasticsearch[es02][[fysearch20210609070500][4]: Lucene Merge Thread #0]'
     4/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:151)
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
     2/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:159)
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
     2/10 snapshots sharing following 11 elements
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.reset(CompressingStoredFieldsReader.java:393)
     2/10 snapshots sharing following 7 elements
       app//org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:229)
       app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:106)
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4463)

   53.0% (264.8ms out of 500ms) cpu usage by thread 'elasticsearch[es02][[fysearch20210609070500][2]: Lucene Merge Thread #8]'
     6/10 snapshots sharing following 14 elements
       app//org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101)
       app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
     2/10 snapshots sharing following 13 elements
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
     2/10 snapshots sharing following 13 elements
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:177)

::: {es06}{dilmrt}{ml.machine_memory=9223372036854771712, rack=r1, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
   Hot threads at 2021-06-09T09:13:52.882Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   38.0% (190ms out of 500ms) cpu usage by thread 'elasticsearch[es06][[fysearch20210609070500][3]: Lucene Merge Thread #9]'
     5/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:151)
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
     3/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:159)
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:441)
     2/10 snapshots sharing following 11 elements
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.reset(CompressingStoredFieldsReader.java:393)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.document(CompressingStoredFieldsReader.java:572)
   
   34.3% (171.3ms out of 500ms) cpu usage by thread 'elasticsearch[es06][[suggestion@fysearch20210609070500][3]: Lucene Merge Thread #14]'
     2/10 snapshots sharing following 18 elements
       java.base@14.0.1/sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:817)
       java.base@14.0.1/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:802)
       app//org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
     5/10 snapshots sharing following 14 elements
       app//org.apache.lucene.util.compress.LZ4.encodeLiterals(LZ4.java:151)
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:172)
     2/10 snapshots sharing following 12 elements
       app//org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:139)
     unique snapshot
       app//org.apache.lucene.util.compress.LZ4.encodeSequence(LZ4.java:177)
   
   34.2% (171.1ms out of 500ms) cpu usage by thread 'elasticsearch[es06][write][T#4]'
     2/10 snapshots sharing following 49 elements
       app//org.apache.lucene.util.fst.ByteSequenceOutputs.read(ByteSequenceOutputs.java:35)
       app//org.apache.lucene.util.fst.FST.readArc(FST.java:1260)
       app//org.apache.lucene.util.fst.FST.readArcByDirectAddressing(FST.java:1211)
     2/10 snapshots sharing following 35 elements
       app//org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:948)
       app//org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:180)
     2/10 snapshots sharing following 36 elements
       app//org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:171)
       app//org.elasticsearch.index.translog.Translog.add(Translog.java:544)
     2/10 snapshots sharing following 42 elements
       app//org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
       app//org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
     2/10 snapshots sharing following 29 elements
       app//org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:157)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:202)

::: {es01}{dilmrt}{ml.machine_memory=9223372036854771712, rack=r1, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
   Hot threads at 2021-06-09T09:13:52.854Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   62.5% (312.6ms out of 500ms) cpu usage by thread 'elasticsearch[es01][write][T#7]'
     2/10 snapshots sharing following 45 elements
       app//org.apache.lucene.index.FieldUpdatesBuffer$BufferedUpdateIterator.nextTerm(FieldUpdatesBuffer.java:350)
       app//org.apache.lucene.index.FieldUpdatesBuffer$BufferedUpdateIterator.next(FieldUpdatesBuffer.java:317)
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:496)
     2/10 snapshots sharing following 45 elements
       app//org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(SegmentTermsEnum.java:742)
       app//org.apache.lucene.index.FrozenBufferedUpdates$TermDocsIterator.nextTerm(FrozenBufferedUpdates.java:829)
     2/10 snapshots sharing following 48 elements
       app//org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:103)
       app//org.apache.lucene.codecs.blocktree.CompressionAlgorithm$3.read(CompressionAlgorithm.java:51)
     2/10 snapshots sharing following 44 elements
       app//org.apache.lucene.index.FrozenBufferedUpdates$TermDocsIterator.nextTerm(FrozenBufferedUpdates.java:829)
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:507)
     2/10 snapshots sharing following 44 elements
       app//org.apache.lucene.index.FieldUpdatesBuffer$BufferedUpdateIterator.next(FieldUpdatesBuffer.java:317)
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:496)
   
   56.2% (281ms out of 500ms) cpu usage by thread 'elasticsearch[es01][[suggestion@fysearch20210609070500][3]: Lucene Merge Thread #55]'
     3/10 snapshots sharing following 12 elements
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:977)
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:942)
     3/10 snapshots sharing following 13 elements
       app//org.apache.lucene.index.MappedMultiFields$MappedMultiTermsEnum.postings(MappedMultiFields.java:127)
       app//org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:127)
     4/10 snapshots sharing following 9 elements
       app//org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
       app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:197)
   
   55.1% (275.4ms out of 500ms) cpu usage by thread 'elasticsearch[es01][[suggestion@fysearch20210609070500][3]: Lucene Merge Thread #48]'
     3/10 snapshots sharing following 14 elements
       app//org.apache.lucene.store.DataOutput.writeVInt(DataOutput.java:191)
       app//org.apache.lucene.codecs.lucene84.Lucene84PostingsWriter.finishTerm(Lucene84PostingsWriter.java:416)
       app//org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:174)
     2/10 snapshots sharing following 14 elements
       app//org.apache.lucene.codecs.lucene84.ForDeltaUtil.encodeDeltas(ForDeltaUtil.java:67)
       app//org.apache.lucene.codecs.lucene84.Lucene84PostingsWriter.startDoc(Lucene84PostingsWriter.java:243)
     4/10 snapshots sharing following 10 elements
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:347)
     unique snapshot
       app//org.apache.lucene.index.MultiTermsEnum.pushTop(MultiTermsEnum.java:275)

::: {es03}{dilmrt}{ml.machine_memory=9223372036854771712, rack=r1, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
   Hot threads at 2021-06-09T09:13:52.854Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   33.9% (169.3ms out of 500ms) cpu usage by thread 'elasticsearch[es03][search][T#8]'
     4/10 snapshots sharing following 6 elements
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695)
     6/10 snapshots sharing following 10 elements
       java.base@14.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@14.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
   
   33.4% (166.7ms out of 500ms) cpu usage by thread 'elasticsearch[es03][write][T#2]'
     6/10 snapshots sharing following 44 elements
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:937)
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:347)
     2/10 snapshots sharing following 48 elements
       app//org.apache.lucene.util.compress.LZ4.compress(LZ4.java:432)
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:853)
     2/10 snapshots sharing following 45 elements
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:977)
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:942)
   
   33.2% (165.8ms out of 500ms) cpu usage by thread 'elasticsearch[es03][[fysearch20210609070500][0]: Lucene Merge Thread #1]'
     3/10 snapshots sharing following 17 elements
       app//org.apache.lucene.codecs.lucene80.IndexedDISI.advance(IndexedDISI.java:384)
       app//org.apache.lucene.codecs.lucene80.IndexedDISI.nextDoc(IndexedDISI.java:459)
     2/10 snapshots sharing following 12 elements
       app//org.apache.lucene.codecs.NormsConsumer$1$1.nextDoc(NormsConsumer.java:146)
       app//org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.writeValues(Lucene80NormsConsumer.java:145)
       app//org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.addNormsField(Lucene80NormsConsumer.java:126)
     3/10 snapshots sharing following 10 elements
       app//org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.addNormsField(Lucene80NormsConsumer.java:111)
       app//org.apache.lucene.codecs.NormsConsumer.mergeNormsField(NormsConsumer.java:108)
     2/10 snapshots sharing following 15 elements
       app//org.apache.lucene.util.packed.MonotonicLongValues.get(MonotonicLongValues.java:40)
       app//org.apache.lucene.util.packed.PackedLongValues.get(PackedLongValues.java:108)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.