Need help diagnosing slow indexing speed

xenoid · March 11, 2020, 12:04pm

Hi guys!
I need some help diagnosing my cluster performance.
Indexing is very slow and the whole cluster as well as Kibana is pretty unresponsive.
Here is my cluster:

We index about 1,5TB or 1.5 bn events every day.
We have multiple indices indexing at the same time with the most load originating from the logstash-* index. (38 Shards, no replicas)
Our cluster is split into two tiers.

T1: SSD, high CPU, high RAM
T2: HDD, medium CPU, high RAM

Here are the index settings for the logstash-* index:
https://gist.github.com/xenoid/9c705ee8b2a5f8b7989be312b887a2dd
This is the config of a typical T1 Data node:
https://gist.github.com/xenoid/c35d7d4a5f39f040abdaa1b5b4bd4282

A screenshot of the Monitoring page for the node:

A screenshot of the Monitoring page for the logstash-* index:

Here are a few minutes of logs from the data node:
https://gist.github.com/xenoid/2278a059b0500e22ad6bcaaa86da7874

Here you can see the bulk indexing queue of several nodes:

Thanks for reading this far.
Please hit me up if you need any more information.

Christian_Dahlqvist · March 11, 2020, 12:18pm

What is the size of your documents? Do you use parent-child or nested documents? Do you perform any updates of existing data or just index new documents? Are you letting Elasticsearch assign the document id or providing one externally?

wifi · March 11, 2020, 12:21pm

To have everything running smooth your system load needs to be below 1

Have you set up some sort of ingest pipeline that performs heavy operations?

How do you work with the data? If you perform aggregations you should rather create a rollup index and use it instead of raw data.

wifi · March 11, 2020, 12:37pm

gist.github.com

https://gist.github.com/xenoid/9c705ee8b2a5f8b7989be312b887a2dd#file-logstash_index_config-L140

logstash_index_config

{
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": "10000"
        }
      },
      "refresh_interval": "120s",
      "indexing": {

This file has been truncated. show original

Maybe the codec should be set to something else? I assume that compression needs CPU. I believe that you have too many tasks that get scheduled back and forth.

Can you run GET _tasks and post the result?

xenoid · March 11, 2020, 12:57pm

Our data consists mainly of log files (webserver, firewall, proxy, ...). Is there a way i can measure the exact size of an average document?
No updates on existing data is performed.
Document IDs are set automatically.
CPU should be good enough. 64 to 128 Cores on every T1 machine.
Here is a CPU Load graph:

No special index pipeline is set up. Processing is done in Logstash.

Here is a list of GET _tasks:

gist.github.com

https://gist.github.com/xenoid/2d4a9de9aa390d5f5d6ca047d5fd9653

get_tasks

{
  "nodes" : {
    "TOQdiNH5SgmRrXtFX5nwYg" : {
      "name" : "server-0300-5",
      "transport_address" : "192.168.1.34:9305",
      "host" : "192.168.1.34",
      "ip" : "192.168.1.34:9305",
      "roles" : [
        "data",
        "ingest"

This file has been truncated. show original

wifi · March 11, 2020, 1:27pm

Looking at the tasks i see two issues

The tasks are distributed unequally

You have two machines that get a lot of indexing work:

"dU-F_RcPRg63UHBBj3m53Q" : {
"name" : "server-0258",
"transport_address" : "10.2.0.236:9301",
"host" : "10.2.0.236",
"ip" : "10.2.0.236:9301",
"roles" : [
  "ingest"
],
"attributes" : {
  "xpack.installed" : "true",
  "zone" : "VM"
},

and

   "LKveL5ccQHqXN2cP41U9wg" : {
    "name" : "server-0259",
    "transport_address" : "10.2.0.237:9301",
    "host" : "10.2.0.237",
    "ip" : "10.2.0.237:9301",
    "roles" : [
      "ingest"
    ],
    "attributes" : {
      "zone" : "VM",
      "xpack.installed" : "true"
    },

Maybe this helps somehow? I have no clue about elasticsearch performance diagnosis and just taking a wild guess.

xenoid · March 11, 2020, 1:36pm

Good catch! server-0258 and server-0259 are the "indexing" nodes. They get the requests from logstash.
They have slightly less CPUs (24) but I noticed server-0259 crashing several times a day without any notice in the log.

wifi · March 11, 2020, 1:48pm

The server-258 is pretty busy with indexing and on top of that it also gets requests like data/read/search and admin/mappings/get

maybe you should take a look at the routing of these requests

Hmmm is it possible that this server also holds kibana?

In the config of kibana you could try to set up all of your nodes instead of just one. If you set up only one server it will act like a loadbalancer or proxy. If it is the overloaded ingest node kibana might be slow

xenoid · March 11, 2020, 2:53pm

server-0258 and server-0259 are the "outward-facing" nodes. They take care of Kibana, REST queries and connection to logstash.
This never caused issues, but some time ago the cluster started performing badly.
I suspect some old setting as the cause.

As the cluster has been upgraded from ES 0.* to now ES6.8.1 , it is inevitable that some deprecated setting will persist somewhere.

But in the next days I will probably try to move Kibana to another server and try again.

EDIT: @wifi
I just changed the ES-hosts in the Kibana configs. Performance did not change.
Kibana 6 takes ages to load the discover page when the logstash-* index is selected. I put the blame on some sort of mild mapping explosion in the logstash-* index.

wifi · March 11, 2020, 4:18pm

What did you put into the kibana setting elasticsearch.hosts:

xenoid · March 12, 2020, 7:57am

I put a few of our fastest data nodes in there.
I also tried reducing the number of replicas on the .kibana index down to 5 from 44.

It did not change performance.

wifi · March 12, 2020, 9:10am

The bad performance you experience is only in Kibana right? Maybe you can get some clues if you use F12 - debugging Tools in the browser. The network monitor in Chrome could be interesting. BTW which Browser do you use?

xenoid · March 12, 2020, 9:22am

It's definitely not just in Kibana.
I just installed a separate server for Kibana and it's decently fast. As soon as ES is involved it gets slow. I tried IE, Firefox, Chrome, Edge and Brave.
Also here's todays indexing queue:

(Different colors represent different nodes)
If I only could diagnose what is making these bulk queries so slow.
They got fast SSDs, big CPUs, and good heap.

EDIT:
Here is the graph after I changed the refresh interval from 30s to 180s:

system · April 9, 2020, 9:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow indexing speed in 3 nodes cluster Elasticsearch	12	1201	March 29, 2019
Elasticsearch poor indexing performance Elasticsearch	6	885	December 1, 2017
Slow bulk indexing Elasticsearch	4	2097	July 5, 2017
Investigate indexing bottleneck Elasticsearch	9	2514	June 22, 2017
Elasticsearch Indexing Rate Elasticsearch	9	3462	July 5, 2017

Need help diagnosing slow indexing speed

Related topics