How to improve performance in this case?

Hi there,

I've setup a single node Elastic stack instance with 48G Ram (24G Heap) and 8 cores.
Everything was working fine until I actually started ingesting some real logs.

Kibana discover tab is now actually unusable. It runs the following query when I want to see a couple of hours of data:

{
  "profile": true, 
  "version": true,
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m",
        "min_doc_count": 1
      }
    }
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-08-22T04:10:26.899Z",
              "lte": "2019-08-22T07:10:26.899Z"
            }
          }
        }
      ],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {}
    },
    "fragment_size": 2147483647
  }
}

This small time window results in this:

{
 "took" : 24603,
 "timed_out" : false,
 "_shards" : {
   "total" : 14,
   "successful" : 14,
   "skipped" : 0,
   "failed" : 0
 },
 "hits" : {
   "total" : {
     "value" : 10000,
     "relation" : "gte"
   },

The profile results tell me that two indexes take the longest:
Query:


Aggregation:

This is the system usage during the query:

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  0   0  99   0   0|   0     0 | 186B  400B|   0     0 | 335   527 
 23   0  77   0   0|   0    84k|2110B  466B|   0     0 |1007   878 
 45   1  54   0   0|   0     0 |  66B  342B|   0     0 |1950  1612 
 44   2  55   0   0|   0   256k| 198B  474B|   0     0 |1979  1432 
 37   9  54   0   0|   0    12k|  66B  342B|   0     0 |2151  1437 
 17   0  83   0   0|   0     0 |  66B  458B|   0     0 | 785   743 
 20   1  79   0   0|   0   784k|  66B  342B|   0     0 | 951   878 
 17   1  81   0   0|   0    12k|  66B  458B|   0     0 | 923   818 
 27   0  72   0   0|   0   280k| 865B 1843B|   0     0 |1350  1092 
 18   1  75   7   0|   0   121M| 440B 1070B|   0     0 |1015   759 
 17   1  62  20   0|   0   177M| 276B  408B|  16k    0 |1040   910 
 20   1  73   7   0|  28k   72M| 132B  416B|  12k    0 |1120  1200 
 18   0  82   0   0|   0     0 |1482B 1496B|8192B    0 |1024  1005 
 18   0  82   0   0|  12k   88k|1960B   77k|4096B    0 | 978   910 
 17   0  83   0   0|   0   128k|  66B  416B|   0     0 | 768   698 
 19   0  81   0   0|   0     0 |  66B  342B|   0     0 | 764   669 
 18   0  82   0   0|   0    56k| 126B  458B|   0     0 | 891   838 
 17   0  83   0   0|   0     0 |  66B  342B|   0     0 | 821   760 
 23   0  77   0   0|   0   172k| 126B  342B|   0     0 |1014   918 
 17   0  83   0   0|   0    72k| 126B  400B|   0     0 | 740   660 
 19   0  81   0   0|   0    12k|  66B  400B|   0     0 | 783   669 
 17   0  83   0   0|   0     0 | 425B  639B|   0     0 | 686   585 
  8   0  91   0   0|4096B    0 | 867B   75k|4096B    0 | 705   776 
  0   0 100   0   0|   0  8192B| 324B 1810B|   0     0 | 297   474 

As is clear from the above output, there does not seem to be a bottleneck.

What do I do to improve performance? Obviously, if I built a cluster, that would be better, but what is the actual limiting factor here? It would be a waste to spend money on hardware when it is unclear for what purpose and what to spend it on specifically. If collecting data for a date_histogram aggregation is simply slow and Kibana is unusable with substantial logs, that is also nice to know upfront.

I have seen other posts of people with similar issues and none of them actually seemed to be able to solve it. I hope I can, with your help :slight_smile:.

I am a bit confused after a quick read. Your query seems to take 28 seconds, but the date histogram aggregation took to two seconds.

Is it the same speed, when profile and version are disabled?

Also just for testing, can you remove the highlight/stored_field/script_fields/docvalue_fields/_source blocks in your request and move the range query into the filter part of your query (and remove the match_all part there)

I have slightly edited my original post to better reflect statistics, this time from the graphical profile UI.
I have tried the query without version and profile and they do not seem to affect the outcome.
After trying this query (using your feedback):

{
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m",
        "min_doc_count": 1
      }
    }
  },
  "query": {
    "bool": {
      "filter": [
          {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-08-22T04:10:26.899Z",
              "lte": "2019-08-22T07:10:26.899Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

The results seem to be:


This still seems to be quite slow.
System stats during the above query:

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  0   0 100   0   0|   0    20k|  66B  342B|   0     0 | 292   468 
  1   0  99   0   0|   0    56k| 342B  542B|   0     0 | 451   668 
  1   0  99   0   0|  12k    0 | 318B  474B|  12k    0 | 439   723 
  3   0  97   0   0|   0   168k| 192B  416B|   0     0 | 455   581 
 19   0  81   0   0|   0     0 |1669B  608B|   0     0 | 828   747 
 33   0  67   0   0|   0    20k|  66B  342B|   0     0 |1170   867 
 33   0  67   0   0|   0     0 |  66B  342B|   0     0 |1095   731 
 33   0  66   0   0|4096B    0 |  66B  342B|4096B    0 |1264   962 
 18   0  82   0   0|   0     0 |  66B  350B|   0     0 | 808   755 
 17   0  83   0   0|   0    68k| 132B  476B|   0     0 | 718   640 
 18   0  82   0   0|   0    20k| 216B  408B|   0     0 | 722   640 
 20   0  80   0   0|   0   388k| 200B  408B|   0     0 |1038   976 
 17   0  82   0   0|   0     0 | 132B  476B|   0     0 | 756   689 
 24   0  76   0   0|   0   176k| 414B  606B|   0     0 |1059   871 
 17   0  82   0   0|   0     0 | 192B  408B|   0     0 | 776   698 
 17   0  83   0   0|   0    40k| 132B  476B|   0     0 | 732   632 
 19   0  81   0   0|   0   168k| 216B  408B|   0     0 | 824   704 
 17   0  83   0   0|   0     0 |  66B  342B|   0     0 | 811   797 
 17   0  83   0   0|   0     0 |  66B  342B|   0     0 | 740   707 
 17   0  83   0   0|   0   116k|  66B  342B|   0     0 | 648   590 
 20   0  80   0   0|   0    44k| 126B  558B|   0     0 | 894   882 
 18   0  82   0   0|   0   252k| 454B  600B|   0     0 | 827   816 
 17   0  83   0   0|   0     0 | 437B  581B|   0     0 | 791   745 
 23   0  76   0   0|   0   168k|1122B 1851B|   0     0 | 998   876 
 17   0  83   0   0|   0     0 | 954B 1772B|   0     0 | 687   614 
  6   0  94   0   0|   0     0 | 660B   72k|   0     0 | 513   599 
  0   0 100   0   0|   0   864k|  66B  342B|   0     0 | 288   445 

out of curiosity how many documents are returned, how many buckets are created, and how many shards are being searched?

Using the query from my previous post (mind you the time is lower here now that data is cached, I had to figure out how to get total amount of hits):

{
  "took" : 13569,
  "timed_out" : false,
  "_shards" : {
    "total" : 15,
    "successful" : 15,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 35023964,
      "relation" : "eq"
    },

These are the buckets:

"aggregations" : {
    "2" : {
      "buckets" : [
        {
          "key_as_string" : "2019-08-22T04:10:00.000Z",
          "key" : 1566447000000,
          "doc_count" : 1441671
        },
        {
          "key_as_string" : "2019-08-22T04:15:00.000Z",
          "key" : 1566447300000,
          "doc_count" : 1569684
        },
        {
          "key_as_string" : "2019-08-22T04:20:00.000Z",
          "key" : 1566447600000,
          "doc_count" : 1549136
        },
        {
          "key_as_string" : "2019-08-22T04:25:00.000Z",
          "key" : 1566447900000,
          "doc_count" : 1577624
        },
        {
          "key_as_string" : "2019-08-22T04:30:00.000Z",
          "key" : 1566448200000,
          "doc_count" : 1638346
        },
        {
          "key_as_string" : "2019-08-22T04:35:00.000Z",
          "key" : 1566448500000,
          "doc_count" : 1571258
        },
        {
          "key_as_string" : "2019-08-22T04:40:00.000Z",
          "key" : 1566448800000,
          "doc_count" : 1596874
        },
        {
          "key_as_string" : "2019-08-22T04:45:00.000Z",
          "key" : 1566449100000,
          "doc_count" : 1218978
        },
        {
          "key_as_string" : "2019-08-22T06:00:00.000Z",
          "key" : 1566453600000,
          "doc_count" : 1608587
        },
        {
          "key_as_string" : "2019-08-22T06:05:00.000Z",
          "key" : 1566453900000,
          "doc_count" : 1594095
        },
        {
          "key_as_string" : "2019-08-22T06:10:00.000Z",
          "key" : 1566454200000,
          "doc_count" : 1593331
        },
        {
          "key_as_string" : "2019-08-22T06:15:00.000Z",
          "key" : 1566454500000,
          "doc_count" : 1623843
        },
        {
          "key_as_string" : "2019-08-22T06:20:00.000Z",
          "key" : 1566454800000,
          "doc_count" : 1607052
        },
        {
          "key_as_string" : "2019-08-22T06:25:00.000Z",
          "key" : 1566455100000,
          "doc_count" : 1605554
        },
        {
          "key_as_string" : "2019-08-22T06:30:00.000Z",
          "key" : 1566455400000,
          "doc_count" : 1618581
        },
        {
          "key_as_string" : "2019-08-22T06:35:00.000Z",
          "key" : 1566455700000,
          "doc_count" : 1600485
        },
        {
          "key_as_string" : "2019-08-22T06:40:00.000Z",
          "key" : 1566456000000,
          "doc_count" : 1609865
        },
        {
          "key_as_string" : "2019-08-22T06:45:00.000Z",
          "key" : 1566456300000,
          "doc_count" : 1657128
        },
        {
          "key_as_string" : "2019-08-22T06:50:00.000Z",
          "key" : 1566456600000,
          "doc_count" : 1628745
        },
        {
          "key_as_string" : "2019-08-22T06:55:00.000Z",
          "key" : 1566456900000,
          "doc_count" : 1676143
        },
        {
          "key_as_string" : "2019-08-22T07:00:00.000Z",
          "key" : 1566457200000,
          "doc_count" : 1657574
        },
        {
          "key_as_string" : "2019-08-22T07:05:00.000Z",
          "key" : 1566457500000,
          "doc_count" : 1633426
        },
        {
          "key_as_string" : "2019-08-22T07:10:00.000Z",
          "key" : 1566457800000,
          "doc_count" : 145984
        }
      ]
    }
  }
}

Have a look at Slow date_histogram after upgrading to 7.3.0 on "dense" indexes.

We experienced exactly the same symptoms as you and could reproduce it with a sample dataset. The query profiler shows that the query phase is fast, but most of the time is spent in the collect phase.

1 Like

Yes, mine was slow in the collect phase as well. I'll check it out.

edit: I see nobody has an answer yet. However, I do believe my problem is different because even if I divide my collection time by 2 or even 10, it is still slow. I just need to know what hardware to throw at the problem to solve it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.