High CPU usage ElasticSearch Causing Search slow and timeout with Kibana

Hello Guys,

Hope you are all doing well!

We are using Fluent-bit + Elastic + Kibana stack for logging our kubernetes containers. However, sometimes the CPU usage goes too high and search from kibana not working.

Setup information:

> Kibana & Elastic version - 7.9.2
> Elastic host - 5 master-5 data running in different namespace on the same kubernetes cluster.
> Fluent-bit(1.7) - to collect the logs
> Storage :  standard disks physical volumes attached of 1.5 TB for each node with total of total 7.5TB 
> Number of indices - 20.(fluent-bit gathers around 300 to 450GB of daily kubernetes logs from around 20 nodes. Logs are stored in datewise single indice. Last 20 days indices only maintained.)
> Shards - 2 (20 primary & 20 replica for 20 indices and few other system generated)
> Total number of docs - 7196678040 (Around 359833902 per indice)
> Elasticsearch usage: screenshot attached.
> Kibana memory usage - 470 MB / 1 GB
> Single index pattern with 60 fields.
::: {logging-es-masters-2}{zM6G5JfuR8eN5lXwC47oWA}{Xf0Vkm3ITqqvEnxfZFpvPQ}{10.32.30.13}{10.32.30.13:9300}{mr}{xpack.installed=true, transform.node=false}
   Hot threads at 2021-09-27T10:07:57.669Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.7% (3.2ms out of 500ms) cpu usage by thread 'elasticsearch[logging-es-masters-2][transport_worker][T#1]'
     2/10 snapshots sharing following 3 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)

::: {logging-es-data-0}{woah-pXsRNaSszo62ZxzCw}{RI4xPzumRqapnXK08yyOvA}{10.32.25.18}{10.32.25.18:9300}{dirt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-09-27T10:07:57.719Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   93.6% (467.9ms out of 500ms) cpu usage by thread 'elasticsearch[logging-es-data-0][generic][T#2]'
     2/10 snapshots sharing following 42 elements
       app//org.apache.lucene.search.ConjunctionDISI.doNext(ConjunctionDISI.java:200)
       app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:240)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:265)
       app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender.lambda$executeChunkRequest$1(RecoverySourceHandler.java:790)
       app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender$$Lambda$5788/0x000000080193d620.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113)
       app//org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163)
       app//org.elasticsearch.action.ActionListener$6.onResponse(ActionListener.java:282)
       app//org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:136)
       app//org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54)
       app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1162)
       app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1162)
       app//org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:213)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15/java.lang.Thread.run(Thread.java:832)
     5/10 snapshots sharing following 43 elements
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
00.02.05.19            50          10  98   21.23   17.47    13.52 mr        -      logging-es-masters-0
00.32.25.18            62          78  98   21.23   17.47    13.52 dirt      -      logging-es-data-0
00.02.05.19            23          10  41    3.04    2.86     3.15 mr        -      logging-es-masters-3
00.02.05.19            48          53  41    3.04    2.86     3.15 dirt      -      logging-es-data-2
00.02.05.19            26          87  34    1.04    1.51     1.76 dirt      -      logging-es-data-3
00.02.05.19            56          10  32    1.04    1.51     1.76 mr        -      logging-es-masters-2
00.02.05.19            44          11  12    0.57    0.57     0.81 mr        -      logging-es-masters-1
00.02.05.19            38          54  12    0.57    0.57     0.81 dirt      -      logging-es-data-1
00.02.05.19           60          10 100   25.23   19.94    13.10 mr        *      logging-es-masters-4
00.02.05.19           32          62 100   25.23   19.94    13.10 dirt      -      logging-es-data-4```

cluster stats:

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "cluster_name" : "logging",
  "cluster_uuid" : "wphMAxMBQ2290daQaxwWgA",
  "timestamp" : 1632737455914,
  "status" : "yellow",
  "indices" : {
    "count" : 36,
    "shards" : {
      "total" : 71,
      "primaries" : 36,
      "replication" : 0.9722222222222222,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 2,
          "avg" : 1.9722222222222223
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9722222222222222
        }
      }
    },
    "docs" : {
      "count" : 6745298896,
      "deleted" : 490704
    },
    "store" : {
      "size" : "5.2tb",
      "size_in_bytes" : 5818418796118,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "total_count" : 880,
      "hit_count" : 0,
      "miss_count" : 880,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 2635,
      "memory" : "120.7mb",
      "memory_in_bytes" : 126598756,
      "terms_memory" : "27.9mb",
      "terms_memory_in_bytes" : 29360096,
      "stored_fields_memory" : "85.9mb",
      "stored_fields_memory_in_bytes" : 90156856,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "3.8mb",
      "norms_memory_in_bytes" : 4073472,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "2.8mb",
      "doc_values_memory_in_bytes" : 3008332,
      "index_writer_memory" : "96.4mb",
      "index_writer_memory_in_bytes" : 101155928,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "2.2kb",
      "fixed_bit_set_memory_in_bytes" : 2288,
      "max_unsafe_auto_id_timestamp" : 1632733058661,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 13,
          "index_count" : 2
        },
        {
          "name" : "boolean",
          "count" : 47,
          "index_count" : 7
        },
        {
          "name" : "date",
          "count" : 157,
          "index_count" : 35
        },
        {
          "name" : "flattened",
          "count" : 9,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 31,
          "index_count" : 3
        },
        {
          "name" : "keyword",
          "count" : 969,
          "index_count" : 33
        },
        {
          "name" : "long",
          "count" : 33,
          "index_count" : 10
        },
        {
          "name" : "nested",
          "count" : 16,
          "index_count" : 6
        },
        {
          "name" : "object",
          "count" : 252,
          "index_count" : 34
        },
        {
          "name" : "text",
          "count" : 682,
          "index_count" : 32
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 10,
      "coordinating_only" : 0,
      "data" : 5,
      "ingest" : 5,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 10,
      "transform" : 5,
      "voting_only" : 0
    },
    "versions" : [
      "7.9.2"
    ],
    "os" : {
      "available_processors" : 25,
      "allocated_processors" : 25,
      "names" : [
        {
          "name" : "Linux",
          "count" : 10
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 10
        }
      ],
      "mem" : {
        "total" : "425.8gb",
        "total_in_bytes" : 457232875520,
        "free" : "260.3gb",
        "free_in_bytes" : 279567822848,
        "used" : "165.4gb",
        "used_in_bytes" : 177665052672,
        "free_percent" : 61,
        "used_percent" : 39
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 143
      },
      "open_file_descriptors" : {
        "min" : 990,
        "max" : 1478,
        "avg" : 1183
      }
    },
    "jvm" : {
      "max_uptime" : "1.7h",
      "max_uptime_in_millis" : 6202109,
      "versions" : [
        {
          "version" : "15",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 10
        }
      ],
      "mem" : {
        "heap_used" : "50gb",
        "heap_used_in_bytes" : 53730895192,
        "heap_max" : "120gb",
        "heap_max_in_bytes" : 128849018880
      },
      "threads" : 522
    },
    "fs" : {
      "total" : "7.6tb",
      "total_in_bytes" : 8440701952000,
      "free" : "1.9tb",
      "free_in_bytes" : 2185250693120,
      "available" : "1.9tb",
      "available_in_bytes" : 2185082920960
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 10
      },
      "http_types" : {
        "security4" : 10
      }
    },
    "discovery_types" : {
      "zen" : 10
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 10
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Your average shard size is 144GB, that's pretty large and you should look at reducing that by a third.

Why do you have 5 masters? That's a weird number to go with.

And lastly, upgrade, 7.9 was released over 12 months ago, 7.15 is latest.

Hello Mark,

Thank you for your reply!

Okay. Let's do some maths here for the available options we have.

Option 1:
Current Heap: 20GB per node(Maximum shard count should be 20 shards per 1 GB of heap configured on the server so it would make around 20 shards x 20 GB = 400 shards on cluster. Am I right?
Data per index: 300 GB and index rotation is daily.
Shard recommendation per index: (amount of data in GB per index/50 = 300GB / 50(as you suggested 1/3 of current max shard size) = 6 shards per index
Replicas: It's recommended to have 2 replicas for failover but let's keep it to 1 replica.
Total Shards per index: 6 primary shards + 1x replicas which is 6 = 12 Shards per index
Total shards on cluster: = 20 indexs(for 20 days logs retentions) x 12 = 140 shards.

Option 2:
Current Heap: 20GB per node(it's recommended to set maximum shard count should be 20 shards per 1 GB of heap configured on the server so it would make around 20 shards x 20 GB = 400 shards on cluster. Am I right?
Data per index: 300GB and index rotation is daily.
Shard recommendation per index: (amount of data in GB per index/10 = 300GB / 10 = 30 shards per index
Replicas: 2 replicas
Total Shards per index: 30 primary shards + 2x replicas which is 60 = 90 Shards per index
Total shards on cluster: 20 indexs(for 20 days logs retentions) x 90 = 1800 shards.

This option of 1800 shards crosses the heap max allocation(option 2) of 400 shards.

Can you please check if my above maths is correct or Am I missing something? or can you recommend correct size to our workloads?

You've got 5 data nodes, so 20 x 20 x 5 = 2000 across your entire cluster.

You should really use ILM here.

Can you suggest the correct number of shards and max shard size per index?

Yes. we have ILM set to rotate the index daily. Can you elaborate it further on how to configure it?

can you recommend the correct number for master in our case?

It depends. But <50GB shard size is what we suggest.

3 should be sufficient.

Thank you for your guidance on this. I'll test it on testing environment along with version upgrade.

Can you please elaborate the dependencies further? Should we rotate the index on particular size like 50GB or by any other way?

Thanks Mark for guiding into right direction.