High cpu usage during search

Hello,
I have tried to read through most of the topics and tried with the suggestions, nothing has helped so far.

We are running an elasticsearch cluster (v7.16.1) with 2 nodes (4 CPUs, heap assigned as 16GB - physical memory is 32 GB). We have been managing our indexes with ILM. We roll our indexes daily, move to warm phase after 2 days where merge is executed to reduce the count of segments per shard to 2 and replicas are reduced to 0.

Search latency is crippling slow - our searches get answered after 10 seconds and with timeout configured for 30seconds, our search requests get timed out.

We use data streams and have 7 data streams with an average of 30 backing indexes. The current write index has one 1 shard and 1 replica. Average shard size is less than 50 GB and there are 1500 segments (across both nodes)

::: {xxxxxxx}{JS2hVWeHQOyV48TgxDlXZw}{HUXiS-ktQpyXWmTwyqeEFA}{xxxxxxx}{aa.bb.cc.dd:9300}{cdfhilmrstw}{ml.machine_memory=33557848064, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, dc_type=dc, ml.max_jvm_size=16785604608}
   Hot threads at 2022-10-17T13:56:16.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.0% [cpu=87.6%, other=12.4%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxxxx][search][T#2]'
     2/10 snapshots sharing following 54 elements
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.readValueUnsafe(AbstractXContentParser.java:394)
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.readMapEntries(AbstractXContentParser.java:318)
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.readValueUnsafe(AbstractXContentParser.java:394)
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.readMapEntries(AbstractXContentParser.java:318)
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.readMapSafe(AbstractXContentParser.java:304)
       app//org.elasticsearch.xcontent.support.AbstractXContentParser.map(AbstractXContentParser.java:254)
       app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:210)
       app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:138)
       app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:106)
       app//org.elasticsearch.search.lookup.SourceLookup.sourceAsMapAndType(SourceLookup.java:90)
       app//org.elasticsearch.search.lookup.SourceLookup.source(SourceLookup.java:79)
       app//org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:93)
       app//org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:109)
       app//org.elasticsearch.script.StringFieldScript$1$1.execute(StringFieldScript.java:35)
       app//org.elasticsearch.script.StringFieldScript.resultsForDoc(StringFieldScript.java:94)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:27)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:19)
       app//org.elasticsearch.search.runtime.AbstractScriptFieldQuery$1$1.matches(AbstractScriptFieldQuery.java:76)
       app//org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:381)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:265)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:245)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:45)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:194)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$26(IndicesService.java:1522)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7190/0x0000000801baa218.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$27(IndicesService.java:1588)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7191/0x0000000801baa978.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:178)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:161)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:419)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:124)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1594)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1516)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:456)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$6500/0x0000000801a72de8.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$6501/0x0000000801a73010.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$6502/0x0000000801a73238.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.2/java.lang.Thread.run(Thread.java:831)
     3/10 snapshots sharing following 58 elements
       java.base@16.0.2/sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:815)
       java.base@16.0.2/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:800)
       app//org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:170)
       app//org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:315)
       app//org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:133)
       app//org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
       app//org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:103)
       app//org.apache.lucene.codecs.lucene87.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress(LZ4WithPresetDictCompressionMode.java:129)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.doReset(CompressingStoredFieldsReader.java:564)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.reset(CompressingStoredFieldsReader.java:466)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.document(CompressingStoredFieldsReader.java:656)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:678)
       app//org.elasticsearch.search.internal.FieldUsageTrackingDirectoryReader$FieldUsageTrackingLeafReader$FieldUsageTrackingStoredFieldsReader.visitDocument(FieldUsageTrackingDirectoryReader.java:204)
       app//org.elasticsearch.search.lookup.SourceLookup$$Lambda$6567/0x0000000801a8f4a8.accept(Unknown Source)
       app//org.elasticsearch.search.lookup.SourceLookup.source(SourceLookup.java:73)
       app//org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:93)
       app//org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:109)
       app//org.elasticsearch.script.StringFieldScript$1$1.execute(StringFieldScript.java:35)
       app//org.elasticsearch.script.StringFieldScript.resultsForDoc(StringFieldScript.java:94)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:27)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:19)
       app//org.elasticsearch.search.runtime.AbstractScriptFieldQuery$1$1.matches(AbstractScriptFieldQuery.java:76)
       app//org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:381)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:265)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:245)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:45)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:194)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$26(IndicesService.java:1522)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7190/0x0000000801baa218.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$27(IndicesService.java:1588)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7191/0x0000000801baa978.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:178)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:161)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:419)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:124)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1594)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1516)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:456)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$6500/0x0000000801a72de8.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$6501/0x0000000801a73010.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$6502/0x0000000801a73238.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.2/java.lang.Thread.run(Thread.java:831)
     2/10 snapshots sharing following 50 elements
       java.base@16.0.2/java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1046)
       java.base@16.0.2/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1045)
       java.base@16.0.2/java.util.AbstractCollection.addAll(AbstractCollection.java:335)
       app//org.elasticsearch.index.fieldvisitor.FieldsVisitor.reset(FieldsVisitor.java:177)
       app//org.elasticsearch.index.fieldvisitor.FieldsVisitor.<init>(FieldsVisitor.java:57)
       app//org.elasticsearch.index.fieldvisitor.FieldsVisitor.<init>(FieldsVisitor.java:50)
       app//org.elasticsearch.search.lookup.SourceLookup.source(SourceLookup.java:72)
       app//org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:93)
       app//org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:109)
       app//org.elasticsearch.script.StringFieldScript$1$1.execute(StringFieldScript.java:35)
       app//org.elasticsearch.script.StringFieldScript.resultsForDoc(StringFieldScript.java:94)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:27)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:19)
       app//org.elasticsearch.search.runtime.AbstractScriptFieldQuery$1$1.matches(AbstractScriptFieldQuery.java:76)
       app//org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:381)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:265)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:245)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:45)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:194)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$26(IndicesService.java:1522)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7190/0x0000000801baa218.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$27(IndicesService.java:1588)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7191/0x0000000801baa978.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:178)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:161)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:419)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:124)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1594)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1516)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:456)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$6500/0x0000000801a72de8.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$6501/0x0000000801a73010.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$6502/0x0000000801a73238.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.2/java.lang.Thread.run(Thread.java:831)
     3/10 snapshots sharing following 43 elements
       app//org.elasticsearch.script.AbstractFieldScript.extractFromSource(AbstractFieldScript.java:93)
       app//org.elasticsearch.script.AbstractFieldScript.emitFromSource(AbstractFieldScript.java:109)
       app//org.elasticsearch.script.StringFieldScript$1$1.execute(StringFieldScript.java:35)
       app//org.elasticsearch.script.StringFieldScript.resultsForDoc(StringFieldScript.java:94)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:27)
       app//org.elasticsearch.search.runtime.AbstractStringScriptFieldQuery.matches(AbstractStringScriptFieldQuery.java:19)
       app//org.elasticsearch.search.runtime.AbstractScriptFieldQuery$1$1.matches(AbstractScriptFieldQuery.java:76)
       app//org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:381)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:265)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:245)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:45)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:194)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:167)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:255)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:212)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:98)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$26(IndicesService.java:1522)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7190/0x0000000801baa218.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$27(IndicesService.java:1588)
       app//org.elasticsearch.indices.IndicesService$$Lambda$7191/0x0000000801baa978.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:178)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:161)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:419)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:124)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1594)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1516)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:456)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:622)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:483)
       app//org.elasticsearch.search.SearchService$$Lambda$6500/0x0000000801a72de8.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$6501/0x0000000801a73010.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$6502/0x0000000801a73238.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.2/java.lang.Thread.run(Thread.java:831)

I will be happy for furnish more details. Would really appreciate some guidance for this.

You should forcemerge down to 1 segment in order to miximize the benefits of a forcemerge.

How many indices and shards are you querying?

What type of storage are you using?

How much data is there on each node?

From that stack trace you seem to be using runtime fields. I guess if you want to improve search performance you need to index those fields.

1 Like

You should forcemerge down to 1 segment in order to miximize the benefits of a forcemerge.

We did try to foremerge manually as one of the todo items (after reading through few docs) but unfortunately that did not help

How many indices and shards are you querying?

We query against the data stream. The largest data stream is 1TB in size, with 50 indices and 64 shards

How much data is there on each node?

Each node is holding around 800GB worth of data, where capacity is around 1.7 TB

What type of storage are you using?

We seem to be using HDD for storage

What we have noted so far there is correlation between high CPU, high GC count, high GC duration and high disk I/O. All these happen on search requests - we don't seem to be having any issues during indexing

From that stack trace you seem to be using runtime fields. I guess if you want to improve search performance you need to index those fields.

We are using dynamic template where we do specify all fields to be indexed. Please see our index template definition below.

Note:

  1. Initially we saw a correlation between CPU usage and warmer duration, so we added option to disable index warmer
  2. We read that fielddata must be set to false so we added that as well
{
  "priority": 100,
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "lifecycle-1"
        },
        "refresh_interval": "10s",
        "unassigned": {
          "node_left": {
            "delayed_timeout": "30m"
          }
        },
        "number_of_shards": "1",
        "number_of_replicas": "1",
        "warmer": {
          "enabled": "false"
        }
      }
    },
    "mappings": {
      "_routing": {
        "required": false
      },
      "numeric_detection": false,
      "dynamic_date_formats": [
        "strict_date_optional_time",
        "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
      ],
      "_source": {
        "excludes": [],
        "includes": [],
        "enabled": true
      },
      "dynamic": true,
      "dynamic_templates": [
        {
          "message_field": {
            "mapping": {
              "fielddata": "false",
              "norms": false,
              "index": "true",
              "type": "text"
            },
            "match_mapping_type": "string",
            "match": "message"
          }
        },
        {
          "string_fields": {
            "mapping": {
              "fielddata": "false",
              "norms": "false",
              "index": "true",
              "type": "keyword"
            },
            "match_mapping_type": "string",
            "match": "*"
          }
        }
      ],
      "date_detection": true,
      "properties": {
        "@timestamp": {
          "index": true,
          "ignore_malformed": false,
          "store": false,
          "type": "date",
          "doc_values": true
        },
        "geoip": {
          "type": "object",
          "properties": {
            "location": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  },
  "index_patterns": [
    "my-app-index*"
  ],
  "data_stream": {
    "hidden": false
  },
  "composed_of": [],
  "_meta": {
    "description": "Index template for indexes matching the pattern my-app-index*"
  }
}

What type of queries are timing out? If it is Kibana dashboards it would be useful to check if these (as Ignacio pointed out) make use of runtime fields and provide some details about the type an number of visualisations in them.

It seems like you have a fair bit of iowait, which likely affects query performance. Please see the docs for suggestions on how to address this.

It may also be that you do not have enough heap for the queries and load the cluster is under. Forcemerging indices down to a single segment (not 2) will help reduce heap usage. This could also be due to the type of queries you are running.

What type of queries are timing out?

Issue starts when search request is made across all fields. Something like below

The above search request increases CPU usage and after that search requests time out. However, if we search as shown below we don't see high CPU usage immediately but when we have too many users (for us that would be around 20 people), we again start facing high CPU

It seems like you have a fair bit of iowait, which likely affects query performance. Please see the docs for suggestions on how to address this.

Thanks. We will look into each of the steps and come back with updates as soon as possible

Forcemerging indices down to a single segment (not 2) will help reduce heap usage

I believe we can only force merge the segments for the index which is not the current write index for the data stream? In which case I will do the force merge and change the ILM policy as well. But, if there is anything else, please do let us know.

Do you have any runtime fields defined in Kibana for these indices?

Do you have any runtime fields defined in Kibana for these indices?

No we don't have any runtime fields in any of these indices. Example of an index settings is given below.

In Kibana, we use the index pattern to find the indices against which the search should be executed. Kibana loads all the fields, but nothing specifically that we don in Kibana with respect to fields

{
  "settings": {
    "index": {
      "refresh_interval": "10s",
      "hidden": "true",
      "blocks": {
        "write": "true"
      },
      "provided_name": "shrink-vk_k-.ds-filebeat-ds-2022.09.02-000315",
      "creation_date": "1663066482085",
      "unassigned": {
        "node_left": {
          "delayed_timeout": "30m"
        }
      },
      "priority": "990",
      "number_of_replicas": "0",
      "uuid": "_lUJr83TSg2x4LqGIGQhXg",
      "version": {
        "created": "7160199"
      },
      "warmer": {
        "enabled": "false"
      },
      "lifecycle": {
        "name": "filebeat-1",
        "parse_origination_date": "true",
        "indexing_complete": "true"
      },
      "routing": {
        "allocation": {
          "include": {
            "_tier_preference": "data_hot"
          },
          "initial_recovery": {
            "_id": "JS2hVWeHQOyV48TgxDlXZw"
          },
          "require": {
            "_id": null
          }
        }
      },
      "number_of_shards": "1",
      "routing_partition_size": "1",
      "resize": {
        "source": {
          "name": ".ds-filebeat-ds-2022.09.02-000315",
          "uuid": "Hi3-AntuSUWqcnJYKMSSmA"
        }
      }
    }
  },
  "defaults": {
    "index": {
      "flush_after_merge": "512mb",
      "final_pipeline": "_none",
      "max_inner_result_window": "100",
      "max_terms_count": "65536",
      "rollup": {
        "source": {
          "name": "",
          "uuid": ""
        }
      },
      "lifecycle": {
        "rollover_alias": "",
        "step": {
          "wait_time_threshold": "12h"
        },
        "origination_date": "-1"
      },
      "force_memory_term_dictionary": "false",
      "max_docvalue_fields_search": "100",
      "merge": {
        "scheduler": {
          "max_thread_count": "2",
          "auto_throttle": "true",
          "max_merge_count": "7"
        },
        "policy": {
          "floor_segment": "2mb",
          "max_merge_at_once_explicit": "30",
          "max_merge_at_once": "10",
          "max_merged_segment": "5gb",
          "expunge_deletes_allowed": "10.0",
          "segments_per_tier": "10.0",
          "deletes_pct_allowed": "33.0"
        }
      },
      "max_refresh_listeners": "1000",
      "max_regex_length": "1000",
      "load_fixed_bitset_filters_eagerly": "true",
      "number_of_routing_shards": "1",
      "write": {
        "wait_for_active_shards": "1"
      },
      "verified_before_close": "false",
      "mapping": {
        "coerce": "false",
        "nested_fields": {
          "limit": "50"
        },
        "depth": {
          "limit": "20"
        },
        "field_name_length": {
          "limit": "9223372036854775807"
        },
        "total_fields": {
          "limit": "1000"
        },
        "nested_objects": {
          "limit": "10000"
        },
        "ignore_malformed": "false",
        "dimension_fields": {
          "limit": "16"
        }
      },
      "source_only": "false",
      "soft_deletes": {
        "enabled": "true",
        "retention": {
          "operations": "0"
        },
        "retention_lease": {
          "period": "12h"
        }
      },
      "max_script_fields": "32",
      "query": {
        "default_field": [
          "*"
        ],
        "parse": {
          "allow_unmapped_fields": "true"
        }
      },
      "format": "0",
      "frozen": "false",
      "sort": {
        "missing": [],
        "mode": [],
        "field": [],
        "order": []
      },
      "codec": "default",
      "max_rescore_window": "10000",
      "max_adjacency_matrix_filters": "100",
      "analyze": {
        "max_token_count": "10000"
      },
      "gc_deletes": "60s",
      "top_metrics_max_size": "10",
      "optimize_auto_generated_id": "true",
      "max_ngram_diff": "1",
      "translog": {
        "generation_threshold_size": "64mb",
        "flush_threshold_size": "512mb",
        "sync_interval": "5s",
        "retention": {
          "size": "-1",
          "age": "-1"
        },
        "durability": "REQUEST"
      },
      "auto_expand_replicas": "false",
      "mapper": {
        "dynamic": "true"
      },
      "recovery": {
        "type": ""
      },
      "requests": {
        "cache": {
          "enable": "true"
        }
      },
      "data_path": "",
      "highlight": {
        "max_analyzed_offset": "1000000"
      },
      "routing": {
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "include": {
            "_tier": ""
          },
          "disk": {
            "watermark": {
              "ignore": "false"
            }
          },
          "exclude": {
            "_tier": ""
          },
          "require": {
            "_tier": ""
          },
          "enable": "all",
          "total_shards_per_node": "-1"
        }
      },
      "search": {
        "slowlog": {
          "level": "TRACE",
          "threshold": {
            "fetch": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            },
            "query": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          }
        },
        "idle": {
          "after": "30s"
        },
        "throttled": "false"
      },
      "fielddata": {
        "cache": "node"
      },
      "default_pipeline": "_none",
      "max_slices_per_scroll": "1024",
      "shard": {
        "check_on_startup": "false"
      },
      "xpack": {
        "watcher": {
          "template": {
            "version": ""
          }
        },
        "version": "",
        "ccr": {
          "following_index": "false"
        }
      },
      "percolator": {
        "map_unmapped_fields_as_text": "false"
      },
      "allocation": {
        "max_retries": "5",
        "existing_shards_allocator": "gateway_allocator"
      },
      "indexing": {
        "slowlog": {
          "reformat": "true",
          "threshold": {
            "index": {
              "warn": "-1",
              "trace": "-1",
              "debug": "-1",
              "info": "-1"
            }
          },
          "source": "1000",
          "level": "TRACE"
        }
      },
      "compound_format": "0.1",
      "blocks": {
        "metadata": "false",
        "read": "false",
        "read_only_allow_delete": "false",
        "read_only": "false"
      },
      "max_result_window": "10000",
      "store": {
        "stats_refresh_interval": "10s",
        "type": "",
        "fs": {
          "fs_lock": "native"
        },
        "preload": [],
        "snapshot": {
          "snapshot_name": "",
          "index_uuid": "",
          "cache": {
            "prewarm": {
              "enabled": "true"
            },
            "enabled": "true",
            "excluded_file_types": []
          },
          "repository_uuid": "",
          "uncached_chunk_size": "-1b",
          "index_name": "",
          "partial": "false",
          "blob_cache": {
            "metadata_files": {
              "max_length": "64kb"
            }
          },
          "repository_name": "",
          "snapshot_uuid": ""
        }
      },
      "queries": {
        "cache": {
          "enabled": "true"
        }
      },
      "shard_limit": {
        "group": "normal"
      },
      "max_shingle_diff": "3",
      "query_string": {
        "lenient": "false"
      }
    }
  }
}

Mapping details

{
  "mappings": {
    "_doc": {
      "dynamic": "true",
      "_data_stream_timestamp": {
        "enabled": true
      },
      "dynamic_date_formats": [
        "strict_date_optional_time",
        "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
      ],
      "dynamic_templates": [
        {
          "message_field": {
            "match": "message",
            "match_mapping_type": "string",
            "mapping": {
              "index": "true",
              "norms": false,
              "type": "text"
            }
          }
        },
        {
          "string_fields": {
            "match": "*",
            "match_mapping_type": "string",
            "mapping": {
              "index": "true",
              "norms": "false",
              "type": "keyword"
            }
          }
        }
      ],
      "date_detection": true,
      "numeric_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "UAID": {
          "type": "keyword"
        },
        "action": {
          "type": "keyword"
        },
        "activity": {
          "type": "keyword"
        },
        "app": {
          "type": "keyword"
        },
        "app_version": {
          "type": "keyword"
        },
        "applog": {
          "type": "keyword"
        },
        "bckgrnd": {
          "type": "keyword"
        },
        "busJob": {
          "type": "keyword"
        },
        "bytes_to_be_transferred": {
          "type": "keyword"
        },
        "certificate_name": {
          "type": "keyword"
        },
        "class": {
          "type": "keyword"
        },
        "cm_order_id": {
          "type": "keyword"
        },
        "cm_session_id": {
          "type": "keyword"
        },
        "cm_store_id": {
          "type": "keyword"
        },
        "contained_in": {
          "type": "keyword"
        },
        "dc_type": {
          "type": "keyword"
        },
        "dest_ip": {
          "type": "keyword"
        },
        "duration": {
          "type": "long"
        },
        "expires_in": {
          "type": "long"
        },
        "first_line": {
          "type": "keyword"
        },
        "geoip": {
          "properties": {
            "location": {
              "type": "geo_point"
            }
          }
        },
        "hobbit_source_path": {
          "type": "keyword"
        },
        "hobbit_target": {
          "type": "keyword"
        },
        "host": {
          "type": "keyword"
        },
        "instance": {
          "type": "keyword"
        },
        "ipg_org_file_name": {
          "type": "keyword"
        },
        "ipg_target_file_name": {
          "type": "keyword"
        },
        "keystore_name": {
          "type": "keyword"
        },
        "kubernetes": {
          "properties": {
            "container": {
              "properties": {
                "name": {
                  "type": "keyword"
                }
              }
            },
            "namespace": {
              "type": "keyword"
            },
            "node": {
              "properties": {
                "name": {
                  "type": "keyword"
                }
              }
            },
            "pod": {
              "type": "object"
            }
          }
        },
        "log_format": {
          "type": "keyword"
        },
        "log_source": {
          "type": "keyword"
        },
        "log_type": {
          "type": "keyword"
        },
        "logger": {
          "type": "keyword"
        },
        "loglevel": {
          "type": "keyword"
        },
        "mcsbusJob": {
          "type": "keyword"
        },
        "message": {
          "type": "text",
          "norms": false
        },
        "not_after": {
          "type": "keyword"
        },
        "not_before": {
          "type": "keyword"
        },
        "order_id": {
          "type": "keyword"
        },
        "reference_id": {
          "type": "keyword"
        },
        "remote_address": {
          "type": "keyword"
        },
        "report_name": {
          "type": "keyword"
        },
        "report_status": {
          "type": "keyword"
        },
        "server_group": {
          "type": "keyword"
        },
        "server_group_env": {
          "type": "keyword"
        },
        "settlement_approach": {
          "type": "keyword"
        },
        "settlement_status": {
          "type": "keyword"
        },
        "signature": {
          "type": "keyword"
        },
        "src_ip": {
          "type": "keyword"
        },
        "status": {
          "type": "keyword"
        },
        "store_id": {
          "type": "keyword"
        },
        "track": {
          "type": "keyword"
        },
        "transfer_status": {
          "type": "keyword"
        },
        "user": {
          "type": "keyword"
        },
        "usrctx": {
          "type": "keyword"
        },
        "uuid": {
          "type": "keyword"
        }
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.