Simple queries takes lots of time and uses 100% cpu

Hi!
I'm having an issue with filtering data gathered using winlogbeat to ES - whenever I want to filter for single host.name or winlog.computer_name (kinda the same) only for last 24 hours - it takes more than 60 000ms and shows an error; few seconds later kibana actually shows the data, but.... why it takes so long? during query time all cpus (24 logical, 12 physical cores on 2 sockets) are running 100% usage.
RAM does not seem to be an issue - node has 86GB of RAM, ES is assigned with xms/xmx 12GB and jmv heapsize currently used is reposted as 7GBs. top/htop reports that server uses 19-20GBs of RAM in total.

Indexes are created daily (winlogbeat-YYYY-MM-dd) wi with 1 shard, 0 replicas.

In total there are 418 indices, 448,496,267 documents and 938 shards.

Filed mappings are taken from winlogbeat_fileds.yml default.

How should I start troubleshooting that? Where should I look for potential issue?
I believe queries like that should take few seconds, not more....shouldn't they?

How many nodes do you have in the cluster? What type of storage are you using? What does disk utilisation and iowait look like when you are querying? How much data are you indexing? How long time period do the indices in the cluster cover?

Thanks for your answer;
It's a single node full elk stack. It utilizes internal SSD disks
I'm indexing about 3-4GBs of data daily (selected windows logs from ~600 endpoints) - it's around 2-3 mln od docs per daily index.

Data is kept for 3 months only.

About disk util / iowait times - I'm gonna verify on that asap.

I don't know if dstat is best tool, but below you can see results of time when query was executed.

All 24 CPU threads are alomost at 100%, disk sdb1 (where data is) had some reads, but I don't know if that's actually much for SSD disk?

in addition - this query resulted 0 results because i have mistaken a host.name - but it still took much time to execute. Shouldn't such fileds be smth like keywords so they are indexed better?

What does your query and data look like?

Here's an example query:

{
  "version": true,
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "30m",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    },
    {
      "field": "event.created",
      "format": "date_time"
    },
    {
      "field": "event.end",
      "format": "date_time"
    },
    {
      "field": "event.start",
      "format": "date_time"
    },
    {
      "field": "file.ctime",
      "format": "date_time"
    },
    {
      "field": "file.mtime",
      "format": "date_time"
    },
    {
      "field": "process.start",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "should": [
                    {
                      "match_phrase": {
                        "winlog.computer_name": "PC01"
                      }
                    }
                  ],
                  "minimum_should_match": 1
                }
              },
              {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "winlog.event_id": 4688
                      }
                    }
                  ],
                  "minimum_should_match": 1
                }
              }
            ]
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-11-02T16:22:09.610Z",
              "lte": "2019-11-03T16:22:09.610Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {}
    },
    "fragment_size": 2147483647
  }
}

and an response (stripped to just one doc of ~ 1200):
https://pastebin.com/g9RPLErs

Which version are you using? If it's before 7.4.0 can you upgrade? It's possible that this is the performance issue fixed by #46670. If the issue persists after upgrading to 7.4.0 I suggest using the nodes hot threads API while the query is running to find out what Elasticsearch is busy doing.

My money is on the highlighter highlighting 500 results

1 Like

Thanks for your answers :slight_smile:

@DavidTurner: I'm running version 7.4.0; hint with nodes hot threads is nice, but to be honest - I don't know how to read it's output :frowning: Here it is, perhaps you can help me with another hint: https://pastebin.com/dQ4sGv6Q

@Mark_Harwood: indeed 500 results are highlighted with filter values, but... isn't that kibana default? Should I override it to smaller number? And after all.... are 500 results that much? Daily I index 2-3mln events from 600 endpoints and there are many cases where i.e. I need to list/export all processes started by specific endpoint in given timerange - usually it's more than 500 per day. Can I list it another way / optimize such queries somehow?

Ok, we are seeing multiple threads all busy in the depths of Lucene. It doesn't look like it's busy highlighting (and highlighting 500 results is the usual behaviour in Kibana). Are you sure that all the fields listed in the docvalue_fields of that search do actually have docvalues enabled? If not, I think that this could be quite expensive. Note this bit of docs:

Note that if the fields parameter specifies fields without docvalues it will try to load the value from the fielddata cache causing the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

@DavidTurner:
ok, I tried to read that part of doc provided but... I do feel like walking in the fog :wink:

is docvalue enabled on index mappings or somewhere else?
Below I stripped out part of current index mapping to show how these fields are defined atm:

  "event": {
    "properties": {
      "created": {
        "type": "date"
      },
      "end": {
        "type": "date"
      },
		...
      },
      "start": {
        "type": "date"
      },
    }
  }
  
  
  
  "file": {
    "properties": {
      "ctime": {
        "type": "date"
      },
      "mtime": {
        "type": "date"
      },
    }
  },
  
  
   "process": {
    "properties": {
      "args": {
        "type": "keyword",
        "ignore_above": 1024
      },
      "start": {
        "type": "date"
      },
    }
  },

But I don't know if that's what I should be looking at. In article it states:

Note that if the fields parameter specifies fields without docvalues it will try to load the value from the fielddata cache causing the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

While I rather notice high CPU usage and memory usage is quite low - 7-8GB jvm heap size (I would give 10 times more if it helps, but it doesn't seem to be used that much... it's rather cpu for some reasons).

I had an idea of enabling more shards (now i have 5 instead of 1) for this index so then it does search in 5 threads in parallel instead of 1, but I don't know if it actually helps on single node ES instance. Does it?

Quoting these docs:

All fields which support doc values have them enabled by default.

date fields support doc values and you haven't explicitly disabled them from the excerpt you quoted so they should be enabled. Is that the case for the mapping of every index touched by the query that's running slowly?

What about the @timestamp field?

@timestamp is also defined as date:

"properties": {
  "@timestamp": {
    "type": "date"
  },

Just to be sure i did run same query against last 24 hours only (two indexes affected, same mappings) and it took 33 000ms ... a lot, just 585 results in total out of 3GBs of data.

It is interesting that the query time is so much lower than the request time. If I am reading the disk I/O stats you showed earlier correctly it looks like the disk is completely saturated. Exactly what type of disk are you using? Is it by any chance a spinning disk with a small SSD cache? Can you rerun the query and run iostat -x while it is running?

Can you also run the query and specify "size": 0 to see if that makes a difference?

Disk is full SSD - HPE 400GB 6G SAS MLC SFF (2.5in) ENT 3y Wty MO0400FBRWC Solid State Drive

About iostat, i run it quite few times since query was executing for very long time:

and the last one, when the query was already executed:

elasticsearch.yml paths:
path.data: /media/sdb/elasticsearch
path.logs: /var/log/elasticsearch

Then, I tried to execute the query with "size": 0, but this is only possible from dev tools/console, not from kibana discovery view - and that's an interesting part. It seems like queries that take 33000ms in the discovery window are executed in 500-800ms in devtools/console. When I added size:0, it lowered down to 100ms.

Those stats look much better so it does not necessarily look like a problem in Elasticsearch. How large is the payload returned to Kibana?

If you meant size of results, it's (run from devtools/console):

{
  "took" : 7023,
  "timed_out" : false,
  "_shards" : {
    "total" : 958,
    "successful" : 958,
    "skipped" : 940,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4088,
      "relation" : "eq"
    },

and complete data size is 1,997KB

same query, for different computer (so it's not cached) and this time executed from discovery window in kibana:

result was 2,693KB

I have done an update to 7.4.2 and rebooted whole server, but it still behaves the same way, no improvment :confused:

Hi, for faster highlighter you can try to set in the properties

"term_vector" : "with_positions_offsets"

To be clear, are you saying that the search that comes from the Discovery tab is taking over 30 seconds but when you perform the exact same search from the Dev Console it takes less than 800ms? Are these timings consistent or is it only occasionally slow in the Discovery tab? Is there any difference in the other things happening in the cluster when running these searches? E.g. do you only see it running slowly when there is also ongoing indexing?