Simple queries takes lots of time and uses 100% cpu

aimarjg · November 2, 2019, 3:06pm

Hi!
I'm having an issue with filtering data gathered using winlogbeat to ES - whenever I want to filter for single host.name or winlog.computer_name (kinda the same) only for last 24 hours - it takes more than 60 000ms and shows an error; few seconds later kibana actually shows the data, but.... why it takes so long? during query time all cpus (24 logical, 12 physical cores on 2 sockets) are running 100% usage.
RAM does not seem to be an issue - node has 86GB of RAM, ES is assigned with xms/xmx 12GB and jmv heapsize currently used is reposted as 7GBs. top/htop reports that server uses 19-20GBs of RAM in total.

Indexes are created daily (winlogbeat-YYYY-MM-dd) wi with 1 shard, 0 replicas.

In total there are 418 indices, 448,496,267 documents and 938 shards.

Filed mappings are taken from winlogbeat_fileds.yml default.

How should I start troubleshooting that? Where should I look for potential issue?
I believe queries like that should take few seconds, not more....shouldn't they?

Christian_Dahlqvist · November 2, 2019, 3:11pm

How many nodes do you have in the cluster? What type of storage are you using? What does disk utilisation and iowait look like when you are querying? How much data are you indexing? How long time period do the indices in the cluster cover?

aimarjg · November 2, 2019, 7:45pm

Thanks for your answer;
It's a single node full elk stack. It utilizes internal SSD disks
I'm indexing about 3-4GBs of data daily (selected windows logs from ~600 endpoints) - it's around 2-3 mln od docs per daily index.

Data is kept for 3 months only.

About disk util / iowait times - I'm gonna verify on that asap.

aimarjg · November 3, 2019, 2:01pm

I don't know if dstat is best tool, but below you can see results of time when query was executed.

All 24 CPU threads are alomost at 100%, disk sdb1 (where data is) had some reads, but I don't know if that's actually much for SSD disk?

in addition - this query resulted 0 results because i have mistaken a host.name - but it still took much time to execute. Shouldn't such fileds be smth like keywords so they are indexed better?

Christian_Dahlqvist · November 3, 2019, 2:11pm

What does your query and data look like?

aimarjg · November 3, 2019, 4:33pm

Here's an example query:

{
  "version": true,
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "30m",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    },
    {
      "field": "event.created",
      "format": "date_time"
    },
    {
      "field": "event.end",
      "format": "date_time"
    },
    {
      "field": "event.start",
      "format": "date_time"
    },
    {
      "field": "file.ctime",
      "format": "date_time"
    },
    {
      "field": "file.mtime",
      "format": "date_time"
    },
    {
      "field": "process.start",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "should": [
                    {
                      "match_phrase": {
                        "winlog.computer_name": "PC01"
                      }
                    }
                  ],
                  "minimum_should_match": 1
                }
              },
              {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "winlog.event_id": 4688
                      }
                    }
                  ],
                  "minimum_should_match": 1
                }
              }
            ]
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-11-02T16:22:09.610Z",
              "lte": "2019-11-03T16:22:09.610Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {}
    },
    "fragment_size": 2147483647
  }
}

and an response (stripped to just one doc of ~ 1200):
https://pastebin.com/g9RPLErs

DavidTurner · November 3, 2019, 5:56pm

Which version are you using? If it's before 7.4.0 can you upgrade? It's possible that this is the performance issue fixed by #46670. If the issue persists after upgrading to 7.4.0 I suggest using the nodes hot threads API while the query is running to find out what Elasticsearch is busy doing.

Mark_Harwood · November 3, 2019, 6:10pm

My money is on the highlighter highlighting 500 results

aimarjg · November 3, 2019, 7:10pm

Thanks for your answers

@DavidTurner: I'm running version 7.4.0; hint with nodes hot threads is nice, but to be honest - I don't know how to read it's output Here it is, perhaps you can help me with another hint: https://pastebin.com/dQ4sGv6Q

@Mark_Harwood: indeed 500 results are highlighted with filter values, but... isn't that kibana default? Should I override it to smaller number? And after all.... are 500 results that much? Daily I index 2-3mln events from 600 endpoints and there are many cases where i.e. I need to list/export all processes started by specific endpoint in given timerange - usually it's more than 500 per day. Can I list it another way / optimize such queries somehow?

DavidTurner · November 3, 2019, 8:12pm

Ok, we are seeing multiple threads all busy in the depths of Lucene. It doesn't look like it's busy highlighting (and highlighting 500 results is the usual behaviour in Kibana). Are you sure that all the fields listed in the docvalue_fields of that search do actually have docvalues enabled? If not, I think that this could be quite expensive. Note this bit of docs:

Note that if the fields parameter specifies fields without docvalues it will try to load the value from the fielddata cache causing the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

aimarjg · November 4, 2019, 8:11pm

@DavidTurner:
ok, I tried to read that part of doc provided but... I do feel like walking in the fog

is docvalue enabled on index mappings or somewhere else?
Below I stripped out part of current index mapping to show how these fields are defined atm:

  "event": {
    "properties": {
      "created": {
        "type": "date"
      },
      "end": {
        "type": "date"
      },
		...
      },
      "start": {
        "type": "date"
      },
    }
  }
  
  
  
  "file": {
    "properties": {
      "ctime": {
        "type": "date"
      },
      "mtime": {
        "type": "date"
      },
    }
  },
  
  
   "process": {
    "properties": {
      "args": {
        "type": "keyword",
        "ignore_above": 1024
      },
      "start": {
        "type": "date"
      },
    }
  },

But I don't know if that's what I should be looking at. In article it states:

Note that if the fields parameter specifies fields without docvalues it will try to load the value from the fielddata cache causing the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

While I rather notice high CPU usage and memory usage is quite low - 7-8GB jvm heap size (I would give 10 times more if it helps, but it doesn't seem to be used that much... it's rather cpu for some reasons).

I had an idea of enabling more shards (now i have 5 instead of 1) for this index so then it does search in 5 threads in parallel instead of 1, but I don't know if it actually helps on single node ES instance. Does it?

DavidTurner · November 4, 2019, 9:05pm

Quoting these docs:

All fields which support doc values have them enabled by default.

date fields support doc values and you haven't explicitly disabled them from the excerpt you quoted so they should be enabled. Is that the case for the mapping of every index touched by the query that's running slowly?

What about the @timestamp field?

aimarjg · November 5, 2019, 6:07am

@timestamp is also defined as date:

"properties": {
  "@timestamp": {
    "type": "date"
  },

Just to be sure i did run same query against last 24 hours only (two indexes affected, same mappings) and it took 33 000ms ... a lot, just 585 results in total out of 3GBs of data.

Christian_Dahlqvist · November 5, 2019, 6:15am

It is interesting that the query time is so much lower than the request time. If I am reading the disk I/O stats you showed earlier correctly it looks like the disk is completely saturated. Exactly what type of disk are you using? Is it by any chance a spinning disk with a small SSD cache? Can you rerun the query and run iostat -x while it is running?

Can you also run the query and specify "size": 0 to see if that makes a difference?

aimarjg · November 5, 2019, 7:19am

Disk is full SSD - HPE 400GB 6G SAS MLC SFF (2.5in) ENT 3y Wty MO0400FBRWC Solid State Drive

About iostat, i run it quite few times since query was executing for very long time:

and the last one, when the query was already executed:

elasticsearch.yml paths:
path.data: /media/sdb/elasticsearch
path.logs: /var/log/elasticsearch

Then, I tried to execute the query with "size": 0, but this is only possible from dev tools/console, not from kibana discovery view - and that's an interesting part. It seems like queries that take 33000ms in the discovery window are executed in 500-800ms in devtools/console. When I added size:0, it lowered down to 100ms.

Christian_Dahlqvist · November 5, 2019, 7:45am

Those stats look much better so it does not necessarily look like a problem in Elasticsearch. How large is the payload returned to Kibana?

aimarjg · November 5, 2019, 8:46am

If you meant size of results, it's (run from devtools/console):

{
  "took" : 7023,
  "timed_out" : false,
  "_shards" : {
    "total" : 958,
    "successful" : 958,
    "skipped" : 940,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4088,
      "relation" : "eq"
    },

and complete data size is 1,997KB

same query, for different computer (so it's not cached) and this time executed from discovery window in kibana:

result was 2,693KB

aimarjg · November 5, 2019, 7:16pm

I have done an update to 7.4.2 and rebooted whole server, but it still behaves the same way, no improvment

alessandro.lapadula · November 6, 2019, 4:33pm

Hi, for faster highlighter you can try to set in the properties

"term_vector" : "with_positions_offsets"

DavidTurner · November 6, 2019, 9:15pm

To be clear, are you saying that the search that comes from the Discovery tab is taking over 30 seconds but when you perform the exact same search from the Dev Console it takes less than 800ms? Are these timings consistent or is it only occasionally slow in the Discovery tab? Is there any difference in the other things happening in the cluster when running these searches? E.g. do you only see it running slowly when there is also ongoing indexing?

Topic		Replies	Views
Filter and Query same taking some time Elasticsearch	7	360	July 6, 2017
Query Performance Elasticsearch	10	1904	July 6, 2017
Improve query time Elasticsearch	16	2151	July 6, 2017
Many slow transactions at index_search_slow_log_file Elasticsearch	12	539	July 6, 2017
Slow search process in kibana Discover Kibana	5	343	June 3, 2021

Simple queries takes lots of time and uses 100% cpu

Related topics