High CPU usage ElasticSearch Causing Search slow and timeout with Kibana

mitesh.gangaramani · September 27, 2021, 10:53am

Hello Guys,

Hope you are all doing well!

We are using Fluent-bit + Elastic + Kibana stack for logging our kubernetes containers. However, sometimes the CPU usage goes too high and search from kibana not working.

Setup information:

> Kibana & Elastic version - 7.9.2
> Elastic host - 5 master-5 data running in different namespace on the same kubernetes cluster.
> Fluent-bit(1.7) - to collect the logs
> Storage :  standard disks physical volumes attached of 1.5 TB for each node with total of total 7.5TB 
> Number of indices - 20.(fluent-bit gathers around 300 to 450GB of daily kubernetes logs from around 20 nodes. Logs are stored in datewise single indice. Last 20 days indices only maintained.)
> Shards - 2 (20 primary & 20 replica for 20 indices and few other system generated)
> Total number of docs - 7196678040 (Around 359833902 per indice)
> Elasticsearch usage: screenshot attached.
> Kibana memory usage - 470 MB / 1 GB
> Single index pattern with 60 fields.

::: {logging-es-masters-2}{zM6G5JfuR8eN5lXwC47oWA}{Xf0Vkm3ITqqvEnxfZFpvPQ}{10.32.30.13}{10.32.30.13:9300}{mr}{xpack.installed=true, transform.node=false}
   Hot threads at 2021-09-27T10:07:57.669Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.7% (3.2ms out of 500ms) cpu usage by thread 'elasticsearch[logging-es-masters-2][transport_worker][T#1]'
     2/10 snapshots sharing following 3 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)

::: {logging-es-data-0}{woah-pXsRNaSszo62ZxzCw}{RI4xPzumRqapnXK08yyOvA}{10.32.25.18}{10.32.25.18:9300}{dirt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-09-27T10:07:57.719Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   93.6% (467.9ms out of 500ms) cpu usage by thread 'elasticsearch[logging-es-data-0][generic][T#2]'
     2/10 snapshots sharing following 42 elements
       app//org.apache.lucene.search.ConjunctionDISI.doNext(ConjunctionDISI.java:200)
       app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:240)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:265)
       app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender.lambda$executeChunkRequest$1(RecoverySourceHandler.java:790)
       app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender$$Lambda$5788/0x000000080193d620.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113)
       app//org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163)
       app//org.elasticsearch.action.ActionListener$6.onResponse(ActionListener.java:282)
       app//org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:136)
       app//org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54)
       app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1162)
       app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1162)
       app//org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:213)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15/java.lang.Thread.run(Thread.java:832)
     5/10 snapshots sharing following 43 elements

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
00.02.05.19            50          10  98   21.23   17.47    13.52 mr        -      logging-es-masters-0
00.32.25.18            62          78  98   21.23   17.47    13.52 dirt      -      logging-es-data-0
00.02.05.19            23          10  41    3.04    2.86     3.15 mr        -      logging-es-masters-3
00.02.05.19            48          53  41    3.04    2.86     3.15 dirt      -      logging-es-data-2
00.02.05.19            26          87  34    1.04    1.51     1.76 dirt      -      logging-es-data-3
00.02.05.19            56          10  32    1.04    1.51     1.76 mr        -      logging-es-masters-2
00.02.05.19            44          11  12    0.57    0.57     0.81 mr        -      logging-es-masters-1
00.02.05.19            38          54  12    0.57    0.57     0.81 dirt      -      logging-es-data-1
00.02.05.19           60          10 100   25.23   19.94    13.10 mr        *      logging-es-masters-4
00.02.05.19           32          62 100   25.23   19.94    13.10 dirt      -      logging-es-data-4```

mitesh.gangaramani · September 27, 2021, 10:59am

cluster stats:

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "cluster_name" : "logging",
  "cluster_uuid" : "wphMAxMBQ2290daQaxwWgA",
  "timestamp" : 1632737455914,
  "status" : "yellow",
  "indices" : {
    "count" : 36,
    "shards" : {
      "total" : 71,
      "primaries" : 36,
      "replication" : 0.9722222222222222,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 2,
          "avg" : 1.9722222222222223
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9722222222222222
        }
      }
    },
    "docs" : {
      "count" : 6745298896,
      "deleted" : 490704
    },
    "store" : {
      "size" : "5.2tb",
      "size_in_bytes" : 5818418796118,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "total_count" : 880,
      "hit_count" : 0,
      "miss_count" : 880,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 2635,
      "memory" : "120.7mb",
      "memory_in_bytes" : 126598756,
      "terms_memory" : "27.9mb",
      "terms_memory_in_bytes" : 29360096,
      "stored_fields_memory" : "85.9mb",
      "stored_fields_memory_in_bytes" : 90156856,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "3.8mb",
      "norms_memory_in_bytes" : 4073472,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "2.8mb",
      "doc_values_memory_in_bytes" : 3008332,
      "index_writer_memory" : "96.4mb",
      "index_writer_memory_in_bytes" : 101155928,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "2.2kb",
      "fixed_bit_set_memory_in_bytes" : 2288,
      "max_unsafe_auto_id_timestamp" : 1632733058661,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 13,
          "index_count" : 2
        },
        {
          "name" : "boolean",
          "count" : 47,
          "index_count" : 7
        },
        {
          "name" : "date",
          "count" : 157,
          "index_count" : 35
        },
        {
          "name" : "flattened",
          "count" : 9,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 31,
          "index_count" : 3
        },
        {
          "name" : "keyword",
          "count" : 969,
          "index_count" : 33
        },
        {
          "name" : "long",
          "count" : 33,
          "index_count" : 10
        },
        {
          "name" : "nested",
          "count" : 16,
          "index_count" : 6
        },
        {
          "name" : "object",
          "count" : 252,
          "index_count" : 34
        },
        {
          "name" : "text",
          "count" : 682,
          "index_count" : 32
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 10,
      "coordinating_only" : 0,
      "data" : 5,
      "ingest" : 5,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 10,
      "transform" : 5,
      "voting_only" : 0
    },
    "versions" : [
      "7.9.2"
    ],
    "os" : {
      "available_processors" : 25,
      "allocated_processors" : 25,
      "names" : [
        {
          "name" : "Linux",
          "count" : 10
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 10
        }
      ],
      "mem" : {
        "total" : "425.8gb",
        "total_in_bytes" : 457232875520,
        "free" : "260.3gb",
        "free_in_bytes" : 279567822848,
        "used" : "165.4gb",
        "used_in_bytes" : 177665052672,
        "free_percent" : 61,
        "used_percent" : 39
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 143
      },
      "open_file_descriptors" : {
        "min" : 990,
        "max" : 1478,
        "avg" : 1183
      }
    },
    "jvm" : {
      "max_uptime" : "1.7h",
      "max_uptime_in_millis" : 6202109,
      "versions" : [
        {
          "version" : "15",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 10
        }
      ],
      "mem" : {
        "heap_used" : "50gb",
        "heap_used_in_bytes" : 53730895192,
        "heap_max" : "120gb",
        "heap_max_in_bytes" : 128849018880
      },
      "threads" : 522
    },
    "fs" : {
      "total" : "7.6tb",
      "total_in_bytes" : 8440701952000,
      "free" : "1.9tb",
      "free_in_bytes" : 2185250693120,
      "available" : "1.9tb",
      "available_in_bytes" : 2185082920960
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 10
      },
      "http_types" : {
        "security4" : 10
      }
    },
    "discovery_types" : {
      "zen" : 10
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 10
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

warkolm · September 27, 2021, 11:46pm

Your average shard size is 144GB, that's pretty large and you should look at reducing that by a third.

Why do you have 5 masters? That's a weird number to go with.

And lastly, upgrade, 7.9 was released over 12 months ago, 7.15 is latest.

mitesh.gangaramani · September 28, 2021, 7:25am

Hello Mark,

Thank you for your reply!

Okay. Let's do some maths here for the available options we have.

Option 1:
Current Heap: 20GB per node(Maximum shard count should be 20 shards per 1 GB of heap configured on the server so it would make around 20 shards x 20 GB = 400 shards on cluster. Am I right?
Data per index: 300 GB and index rotation is daily.
Shard recommendation per index: (amount of data in GB per index/50 = 300GB / 50(as you suggested 1/3 of current max shard size) = 6 shards per index
Replicas: It's recommended to have 2 replicas for failover but let's keep it to 1 replica.
Total Shards per index: 6 primary shards + 1x replicas which is 6 = 12 Shards per index
Total shards on cluster: = 20 indexs(for 20 days logs retentions) x 12 = 140 shards.

Option 2:
Current Heap: 20GB per node(it's recommended to set maximum shard count should be 20 shards per 1 GB of heap configured on the server so it would make around 20 shards x 20 GB = 400 shards on cluster. Am I right?
Data per index: 300GB and index rotation is daily.
Shard recommendation per index: (amount of data in GB per index/10 = 300GB / 10 = 30 shards per index
Replicas: 2 replicas
Total Shards per index: 30 primary shards + 2x replicas which is 60 = 90 Shards per index
Total shards on cluster: 20 indexs(for 20 days logs retentions) x 90 = 1800 shards.

This option of 1800 shards crosses the heap max allocation(option 2) of 400 shards.

Can you please check if my above maths is correct or Am I missing something? or can you recommend correct size to our workloads?

warkolm · September 28, 2021, 7:30am

You've got 5 data nodes, so 20 x 20 x 5 = 2000 across your entire cluster.

You should really use ILM here.

mitesh.gangaramani · September 28, 2021, 7:37am

Can you suggest the correct number of shards and max shard size per index?

Yes. we have ILM set to rotate the index daily. Can you elaborate it further on how to configure it?

can you recommend the correct number for master in our case?

warkolm · September 28, 2021, 7:42am

It depends. But <50GB shard size is what we suggest.

3 should be sufficient.

mitesh.gangaramani · September 28, 2021, 7:47am

Thank you for your guidance on this. I'll test it on testing environment along with version upgrade.

Can you please elaborate the dependencies further? Should we rotate the index on particular size like 50GB or by any other way?

mitesh.gangaramani · September 30, 2021, 12:48pm

Thanks Mark for guiding into right direction.

system · October 28, 2021, 12:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How many shards for 60GB daily index? Elasticsearch	13	2916	November 23, 2017
My kibana has nothing but i think is elasticsearch problem Elasticsearch	11	2113	August 21, 2018
Kibana slowness and random errors Kibana	6	453	October 8, 2021
Kibana thrown a tantrum with Dashboard. ES timeout 30000ms Elasticsearch	6	1027	July 5, 2017
What can I do to make the "readings" do not disturb "writings"? Elasticsearch	7	420	July 6, 2017

High CPU usage ElasticSearch Causing Search slow and timeout with Kibana

Related topics