Kibana slowness and random errors

Hello Experts,

Kibana becomes too slow sometimes and keep loading. Also, I want to know if the current elasticsearch setup is correct or not as sometimes it shows spikes in CPU and RAM and stop logs from some nodes.

Setup information.

> Kibana & Elastic version - 7.9.2
> Elastic host - 5 master-5 data running in different namespace on the same kubernetes cluster.
> Fluent-bit(1.7) - to collect the logs
> Storage :  standard disks physical volumes attached of 1.5 TB for each node with total of total 7.5TB 
> Number of indices - 20.(fluent-bit gathers around 300 to 450GB of daily kubernetes logs from around 20 nodes. Logs are stored in datewise single indice. Last 20 days indices only maintained.)
> Shards - 2 (20 primary & 20 replica for 20 indices and few other system generated)
> Total number of docs - 7196678040 (Around 359833902 per indice)
> Elasticsearch usage: screenshot attached.
> Kibana memory usage - 470 MB / 1 GB
> Single index pattern with 60 fields.

Kibana is sometimes too slow and returns error:

Error: Not Found at Fetch._callee3$ (https://10.128.1.1:5601/33984/bundles/core/core.entry.js:34:109213) at l (https://10.128.1.1:5601/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155323) at Generator._invoke (https://10.128.1.1:5601/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155076) at Generator.forEach.e.<computed> [as next] (https://10.128.1.1:5601/33984/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155680) at fetch_asyncGeneratorStep (https://10.128.1.1:5601/33984/bundles/core/core.entry.js:34:102354) at _next (https://10.128.1.1:5601/33984/bundles/core/core.entry.js:34:102670)

Can you please advise on this?

Welcome to our community! :smiley:

Can you upgrade, 7.14 is latest.

What is the output from the _cluster/stats?pretty&human API?

Hello Mark,

Thank you for your reply!

Sure, I'll read for the upgrade steps in kubernetes.

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "cluster_name" : "logging",
  "cluster_uuid" : "wphMsdsAxMBQ229dss0daQaxwWgA",
  "timestamp" : 1630990752630,
  "status" : "green",
  "indices" : {
    "count" : 36,
    "shards" : {
      "total" : 72,
      "primaries" : 36,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 5141583304,
      "deleted" : 75571
    },
    "store" : {
      "size" : "4.3tb",
      "size_in_bytes" : 4799251378621,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "107.1mb",
      "memory_size_in_bytes" : 112379786,
      "total_count" : 3723370,
      "hit_count" : 105803,
      "miss_count" : 3617567,
      "cache_size" : 1396,
      "cache_count" : 17503,
      "evictions" : 16107
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 2224,
      "memory" : "97.9mb",
      "memory_in_bytes" : 102710640,
      "terms_memory" : "23.8mb",
      "terms_memory_in_bytes" : 25022016,
      "stored_fields_memory" : "68.3mb",
      "stored_fields_memory_in_bytes" : 71707824,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "3.3mb",
      "norms_memory_in_bytes" : 3470912,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "2.3mb",
      "doc_values_memory_in_bytes" : 2509888,
      "index_writer_memory" : "84.4mb",
      "index_writer_memory_in_bytes" : 88523952,
      "version_map_memory" : "1.1kb",
      "version_map_memory_in_bytes" : 1141,
      "fixed_bit_set" : "11.4kb",
      "fixed_bit_set_memory_in_bytes" : 11744,
      "max_unsafe_auto_id_timestamp" : 1630972801734,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 13,
          "index_count" : 2
        },
        {
          "name" : "boolean",
          "count" : 47,
          "index_count" : 7
        },
        {
          "name" : "date",
          "count" : 157,
          "index_count" : 35
        },
        {
          "name" : "flattened",
          "count" : 9,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 31,
          "index_count" : 3
        },
        {
          "name" : "keyword",
          "count" : 964,
          "index_count" : 33
        },
        {
          "name" : "long",
          "count" : 33,
          "index_count" : 10
        },
        {
          "name" : "nested",
          "count" : 16,
          "index_count" : 6
        },
        {
          "name" : "object",
          "count" : 252,
          "index_count" : 34
        },
        {
          "name" : "text",
          "count" : 677,
          "index_count" : 32
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 10,
      "coordinating_only" : 0,
      "data" : 5,
      "ingest" : 5,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 10,
      "transform" : 5,
      "voting_only" : 0
    },
    "versions" : [
      "7.9.2"
    ],
    "os" : {
      "available_processors" : 20,
      "allocated_processors" : 20,
      "names" : [
        {
          "name" : "Linux",
          "count" : 10
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 10
        }
      ],
      "mem" : {
        "total" : "425.8gb",
        "total_in_bytes" : 457257885696,
        "free" : "253.1gb",
        "free_in_bytes" : 271838982144,
        "used" : "172.6gb",
        "used_in_bytes" : 185418903552,
        "free_percent" : 59,
        "used_percent" : 41
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 23
      },
      "open_file_descriptors" : {
        "min" : 620,
        "max" : 1020,
        "avg" : 784
      }
    },
    "jvm" : {
      "max_uptime" : "57.7d",
      "max_uptime_in_millis" : 4992237504,
      "versions" : [
        {
          "version" : "15",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 10
        }
      ],
      "mem" : {
        "heap_used" : "41.8gb",
        "heap_used_in_bytes" : 44987517400,
        "heap_max" : "120gb",
        "heap_max_in_bytes" : 128849018880
      },
      "threads" : 510
    },
    "fs" : {
      "total" : "7.6tb",
      "total_in_bytes" : 8440701952000,
      "free" : "3.2tb",
      "free_in_bytes" : 3587225735168,
      "available" : "3.2tb",
      "available_in_bytes" : 3587057963008
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 10
      },
      "http_types" : {
        "security4" : 10
      }
    },
    "discovery_types" : {
      "zen" : 10
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 10
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Sometimes, the elasticsearch indextime also goes to 2 to 3s.

Thanks. There's nothing super obvious to me in all that; Heap use is relatively low, you aren't over sharded, etc.

Are you using the inbuilt Monitoring functionality?

1 Like

Thank you for reviewing it, Mark :slight_smile:

However, the Kibana Discover section frequently shows following symptoms:

1] It throws above error after taking too long to return the logs of 12 hours or even 24 hours.
2] Logs returns results after long time(after we click "Run query beyond timeout").
3] Sometimes, we need to logout Kibana, wait for sometime and try again to use it. And that start working after few attempts.

So, can you give me some quick debug steps, which I should check whenever we face any of these behaviours?

No, we are using elasticsearch exporter deployment to manage it with prometheus and Grafana. Most of the time, we receive alert for Elasticsearch Index time. We have set that to 1s in alert.

Hi,

Any inputs on this?

Any direction to debug it further would be greatly appreciated :slight_smile: