Hi there,
I've setup a single node Elastic stack instance with 48G Ram (24G Heap) and 8 cores.
Everything was working fine until I actually started ingesting some real logs.
Kibana discover tab is now actually unusable. It runs the following query when I want to see a couple of hours of data:
{
"profile": true,
"version": true,
"size": 500,
"sort": [
{
"@timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"_source": {
"excludes": []
},
"aggs": {
"2": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "5m",
"min_doc_count": 1
}
}
},
"stored_fields": [
"*"
],
"script_fields": {},
"docvalue_fields": [
{
"field": "@timestamp",
"format": "date_time"
}
],
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2019-08-22T04:10:26.899Z",
"lte": "2019-08-22T07:10:26.899Z"
}
}
}
],
"filter": [
{
"match_all": {}
}
],
"should": [],
"must_not": []
}
},
"highlight": {
"pre_tags": [
"@kibana-highlighted-field@"
],
"post_tags": [
"@/kibana-highlighted-field@"
],
"fields": {
"*": {}
},
"fragment_size": 2147483647
}
}
This small time window results in this:
{
"took" : 24603,
"timed_out" : false,
"_shards" : {
"total" : 14,
"successful" : 14,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
The profile results tell me that two indexes take the longest:
Query:
Aggregation:
This is the system usage during the query:
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
0 0 99 0 0| 0 0 | 186B 400B| 0 0 | 335 527
23 0 77 0 0| 0 84k|2110B 466B| 0 0 |1007 878
45 1 54 0 0| 0 0 | 66B 342B| 0 0 |1950 1612
44 2 55 0 0| 0 256k| 198B 474B| 0 0 |1979 1432
37 9 54 0 0| 0 12k| 66B 342B| 0 0 |2151 1437
17 0 83 0 0| 0 0 | 66B 458B| 0 0 | 785 743
20 1 79 0 0| 0 784k| 66B 342B| 0 0 | 951 878
17 1 81 0 0| 0 12k| 66B 458B| 0 0 | 923 818
27 0 72 0 0| 0 280k| 865B 1843B| 0 0 |1350 1092
18 1 75 7 0| 0 121M| 440B 1070B| 0 0 |1015 759
17 1 62 20 0| 0 177M| 276B 408B| 16k 0 |1040 910
20 1 73 7 0| 28k 72M| 132B 416B| 12k 0 |1120 1200
18 0 82 0 0| 0 0 |1482B 1496B|8192B 0 |1024 1005
18 0 82 0 0| 12k 88k|1960B 77k|4096B 0 | 978 910
17 0 83 0 0| 0 128k| 66B 416B| 0 0 | 768 698
19 0 81 0 0| 0 0 | 66B 342B| 0 0 | 764 669
18 0 82 0 0| 0 56k| 126B 458B| 0 0 | 891 838
17 0 83 0 0| 0 0 | 66B 342B| 0 0 | 821 760
23 0 77 0 0| 0 172k| 126B 342B| 0 0 |1014 918
17 0 83 0 0| 0 72k| 126B 400B| 0 0 | 740 660
19 0 81 0 0| 0 12k| 66B 400B| 0 0 | 783 669
17 0 83 0 0| 0 0 | 425B 639B| 0 0 | 686 585
8 0 91 0 0|4096B 0 | 867B 75k|4096B 0 | 705 776
0 0 100 0 0| 0 8192B| 324B 1810B| 0 0 | 297 474
As is clear from the above output, there does not seem to be a bottleneck.
What do I do to improve performance? Obviously, if I built a cluster, that would be better, but what is the actual limiting factor here? It would be a waste to spend money on hardware when it is unclear for what purpose and what to spend it on specifically. If collecting data for a date_histogram aggregation is simply slow and Kibana is unusable with substantial logs, that is also nice to know upfront.
I have seen other posts of people with similar issues and none of them actually seemed to be able to solve it. I hope I can, with your help .