Need help diagnosing slow indexing speed

Hi guys!
I need some help diagnosing my cluster performance.
Indexing is very slow and the whole cluster as well as Kibana is pretty unresponsive.
Here is my cluster:


We index about 1,5TB or 1.5 bn events every day.
We have multiple indices indexing at the same time with the most load originating from the logstash-* index. (38 Shards, no replicas)
Our cluster is split into two tiers.

  • T1: SSD, high CPU, high RAM
  • T2: HDD, medium CPU, high RAM

Here are the index settings for the logstash-* index:
https://gist.github.com/xenoid/9c705ee8b2a5f8b7989be312b887a2dd
This is the config of a typical T1 Data node:
https://gist.github.com/xenoid/c35d7d4a5f39f040abdaa1b5b4bd4282

A screenshot of the Monitoring page for the node:

A screenshot of the Monitoring page for the logstash-* index:

Here are a few minutes of logs from the data node:
https://gist.github.com/xenoid/2278a059b0500e22ad6bcaaa86da7874

Here you can see the bulk indexing queue of several nodes:

Thanks for reading this far.
Please hit me up if you need any more information.

What is the size of your documents? Do you use parent-child or nested documents? Do you perform any updates of existing data or just index new documents? Are you letting Elasticsearch assign the document id or providing one externally?

To have everything running smooth your system load needs to be below 1

Have you set up some sort of ingest pipeline that performs heavy operations?

How do you work with the data? If you perform aggregations you should rather create a rollup index and use it instead of raw data.

Maybe the codec should be set to something else? I assume that compression needs CPU. I believe that you have too many tasks that get scheduled back and forth.

Can you run GET _tasks and post the result?

Our data consists mainly of log files (webserver, firewall, proxy, ...). Is there a way i can measure the exact size of an average document?
No updates on existing data is performed.
Document IDs are set automatically.
CPU should be good enough. 64 to 128 Cores on every T1 machine.
Here is a CPU Load graph:

No special index pipeline is set up. Processing is done in Logstash.

Here is a list of GET _tasks:

Looking at the tasks i see two issues

  1. The tasks are distributed unequally

  2. You have two machines that get a lot of indexing work:

    "dU-F_RcPRg63UHBBj3m53Q" : {
    "name" : "server-0258",
    "transport_address" : "10.2.0.236:9301",
    "host" : "10.2.0.236",
    "ip" : "10.2.0.236:9301",
    "roles" : [
      "ingest"
    ],
    "attributes" : {
      "xpack.installed" : "true",
      "zone" : "VM"
    },
    

and

   "LKveL5ccQHqXN2cP41U9wg" : {
    "name" : "server-0259",
    "transport_address" : "10.2.0.237:9301",
    "host" : "10.2.0.237",
    "ip" : "10.2.0.237:9301",
    "roles" : [
      "ingest"
    ],
    "attributes" : {
      "zone" : "VM",
      "xpack.installed" : "true"
    },

Maybe this helps somehow? I have no clue about elasticsearch performance diagnosis and just taking a wild guess.

Good catch! server-0258 and server-0259 are the "indexing" nodes. They get the requests from logstash.
They have slightly less CPUs (24) but I noticed server-0259 crashing several times a day without any notice in the log.

The server-258 is pretty busy with indexing and on top of that it also gets requests like data/read/search and admin/mappings/get

maybe you should take a look at the routing of these requests

Hmmm is it possible that this server also holds kibana?

In the config of kibana you could try to set up all of your nodes instead of just one. If you set up only one server it will act like a loadbalancer or proxy. If it is the overloaded ingest node kibana might be slow

server-0258 and server-0259 are the "outward-facing" nodes. They take care of Kibana, REST queries and connection to logstash.
This never caused issues, but some time ago the cluster started performing badly.
I suspect some old setting as the cause.

As the cluster has been upgraded from ES 0.* to now ES6.8.1 , it is inevitable that some deprecated setting will persist somewhere.

But in the next days I will probably try to move Kibana to another server and try again.

EDIT: @wifi
I just changed the ES-hosts in the Kibana configs. Performance did not change.
Kibana 6 takes ages to load the discover page when the logstash-* index is selected. I put the blame on some sort of mild mapping explosion in the logstash-* index.

What did you put into the kibana setting elasticsearch.hosts:

I put a few of our fastest data nodes in there.
I also tried reducing the number of replicas on the .kibana index down to 5 from 44.
image
It did not change performance.

The bad performance you experience is only in Kibana right? Maybe you can get some clues if you use F12 - debugging Tools in the browser. The network monitor in Chrome could be interesting. BTW which Browser do you use?

It's definitely not just in Kibana.
I just installed a separate server for Kibana and it's decently fast. As soon as ES is involved it gets slow. I tried IE, Firefox, Chrome, Edge and Brave.
Also here's todays indexing queue:


(Different colors represent different nodes)
If I only could diagnose what is making these bulk queries so slow.
They got fast SSDs, big CPUs, and good heap.

EDIT:
Here is the graph after I changed the refresh interval from 30s to 180s:
image

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.