Elasticsearch Cluster Timeouts

Hello

It's my first topic , so I apreciate any help :slight_smile:

Scenario : I have filebeat installed on multiple windows machines sending data to 3 logstash (ubuntu servers). Filebeat is configured to spread the load evenly between all servers, this is the only way to achive 100% load balancing without additional proxy servers.
The logstash servers indexes the logs and sends them to 5 Elasticsearch servers.
The Kibana server connects to an elasticsearch instance running on localhost. The local ES instance acts as a node connected to all ES servers.

Daily data is around 200Gb. I need the data (be able to search from Kibana for max 7 days). Note that logs breaks down into 5shards ( 0 replicas )

elasticsearch :
"version" : {
"number" : "5.6.10",
"build_hash" : "b727a60",
"build_date" : "2018-06-06T15:48:34.860Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"

logstash
version:logstash 5.6.10

kibana
Version: 5.6.10

Please note that all servers are on the same subnet ( private network only).

From 1 week I receive a lot of timeouts (Timeout after 30000ms) and (Bad Gateway errors) specialy when searching for bigger data (elasticsearch fails - in logs I can see timeouts)

Kibana:8gb ram , dual core proc 2,30Ghz
ES: 16gb ram (java set to use 8gb) , quad proc
LS: 16gb ram (java set to use 8gb) , quad proc

Q: Is this env not enough for the daily 200GB of logs? Should i add more ES or some upgrade on ELK will solve something?
Thanks in advance.

Regards.

Can you provide the full output of the cluster stats API? This will give us an overview of the cluster and a better idea of what may be going wrong.

What type of storage does the Elasticsearch nodes have (local SSD, local spinning disk, SAN...)? What does disk I/O and performance look like when queries time out?

Hello

cluster API stats when everything is running ok:

curl -s -XGET 'localhost:9200/_cluster/stats?human&pretty'
{
"_nodes" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"cluster_name" : "blabla",
"timestamp" : 1534312573747,
"status" : "green",
"indices" : {
"count" : 25,
"shards" : {
"total" : 99,
"primaries" : 94,
"replication" : 0.05319148936170213,
"index" : {
"shards" : {
"min" : 2,
"max" : 5,
"avg" : 3.96
},
"primaries" : {
"min" : 1,
"max" : 4,
"avg" : 3.76
},
"replication" : {
"min" : 0.0,
"max" : 4.0,
"avg" : 0.2
}
}
},
"docs" : {
"count" : 4476577857,
"deleted" : 3
},
"store" : {
"size" : "3.6tb",
"size_in_bytes" : 4035471747522,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "3.8gb",
"memory_size_in_bytes" : 4125258509,
"total_count" : 46946327,
"hit_count" : 13910459,
"miss_count" : 33035868,
"cache_size" : 64037,
"cache_count" : 579833,
"evictions" : 515796
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 2306,
"memory" : "11.7gb",
"memory_in_bytes" : 12617781947,
"terms_memory" : "10.4gb",
"terms_memory_in_bytes" : 11235140920,
"stored_fields_memory" : "959.3mb",
"stored_fields_memory_in_bytes" : 1005949768,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "747kb",
"norms_memory_in_bytes" : 764992,
"points_memory" : "88.9mb",
"points_memory_in_bytes" : 93283959,
"doc_values_memory" : "269.5mb",
"doc_values_memory_in_bytes" : 282642308,
"index_writer_memory" : "134.3mb",
"index_writer_memory_in_bytes" : 140901412,
"version_map_memory" : "2.8mb",
"version_map_memory_in_bytes" : 2940069,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1534309204962,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 6,
"data" : 5,
"coordinating_only" : 1,
"master" : 5,
"ingest" : 5
},
"versions" : [
"5.6.10"
],
"os" : {
"available_processors" : 22,
"allocated_processors" : 22,
"names" : [
{
"name" : "Linux",
"count" : 6
}
],
"mem" : {
"total" : "86.1gb",
"total_in_bytes" : 92499255296,
"free" : "1.8gb",
"free_in_bytes" : 1989488640,
"used" : "84.2gb",
"used_in_bytes" : 90509766656,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 123
},
"open_file_descriptors" : {
"min" : 365,
"max" : 490,
"avg" : 394
}
},
"jvm" : {
"max_uptime" : "5.7d",
"max_uptime_in_millis" : 494496965,
"versions" : [
{
"version" : "1.8.0_181",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.181-b13",
"vm_vendor" : "Oracle Corporation",
"count" : 6
}
],
"mem" : {
"heap_used" : "27.2gb",
"heap_used_in_bytes" : 29221230968,
"heap_max" : "44.8gb",
"heap_max_in_bytes" : 48126623744
},
"threads" : 400
},
"fs" : {
"total" : "6.2tb",
"total_in_bytes" : 6869140582400,
"free" : "2.3tb",
"free_in_bytes" : 2588563095552,
"available" : "2tb",
"available_in_bytes" : 2255513464832
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 6
},
"http_types" : {
"netty4" : 6

HDD: Volume type: GP2 (per ES server)
IOPS : 3900
/dev/xvdb 1.3T 805G 414G 67% /data

Hi
When ES has load or I receive Kib timeouts output is :
curl -XGET http://localhost:9200/_cluster/stats?pretty
{
"_nodes" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"cluster_name" : "nameblabla",
"timestamp" : 1534319133347,
"status" : "green",
"indices" : {
"count" : 25,
"shards" : {
"total" : 99,
"primaries" : 94,
"replication" : 0.05319148936170213,
"index" : {
"shards" : {
"min" : 2,
"max" : 5,
"avg" : 3.96
},
"primaries" : {
"min" : 1,
"max" : 4,
"avg" : 3.76
},
"replication" : {
"min" : 0.0,
"max" : 4.0,
"avg" : 0.2
}
}
},
"docs" : {
"count" : 4530553380,
"deleted" : 3
},
"store" : {
"size_in_bytes" : 4081134819869,
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 159464,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 3830803181,
"total_count" : 45171282,
"hit_count" : 13292501,
"miss_count" : 31878781,
"cache_size" : 61488,
"cache_count" : 557974,
"evictions" : 496486
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 2396,
"memory_in_bytes" : 12778767580,
"terms_memory_in_bytes" : 11370192939,
"stored_fields_memory_in_bytes" : 1018267856,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 774528,
"points_memory_in_bytes" : 94411417,
"doc_values_memory_in_bytes" : 295120840,
"index_writer_memory_in_bytes" : 134809840,
"version_map_memory_in_bytes" : 2444625,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1534314054344,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 6,
"data" : 5,
"coordinating_only" : 1,
"master" : 5,
"ingest" : 5
},
"versions" : [
"5.6.10"
],
"os" : {
"available_processors" : 22,
"allocated_processors" : 22,
"names" : [
{
"name" : "Linux",
"count" : 6
}
],
"mem" : {
"total_in_bytes" : 92499255296,
"free_in_bytes" : 1839747072,
"used_in_bytes" : 90659508224,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 191
},
"open_file_descriptors" : {
"min" : 371,
"max" : 489,
"avg" : 397
}
},
"jvm" : {
"max_uptime_in_millis" : 501056508,
"versions" : [
{
"version" : "1.8.0_181",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.181-b13",
"vm_vendor" : "Oracle Corporation",
"count" : 6
}
],
"mem" : {
"heap_used_in_bytes" : 25125369160,
"heap_max_in_bytes" : 48126623744
},
"threads" : 414
},
"fs" : {
"total_in_bytes" : 6869140582400,
"free_in_bytes" : 2782896795648,
"available_in_bytes" : 2449847164928
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 6
},
"http_types" : {
"netty4" : 6

It timeouts when I do a 7 days query with graphs for example, please let me know if you need extra details.

Regards

Hi

Should I still use gp2 as volume type and try to increase the Max IOPS/Volume or should I upgrade to io1 (same volume size but with better IOPS/Throughput) ?

Thank you

Look to determine what is limiting performance when you are seeing timeouts. Is CPU pegged at 100%? Do you see a lot of iowait if you use iostat? Do you see anything in the Elasticsearch logs that could point to a problem, e.g. long or frequent GC?

Hello

Please see the graphs from Kibana and ES nodes when receiving timeouts:
image
(cpu is pegged at 100%)

Disk RW on ES :
image

Elasticsearch 1 IO :
image

Elas 2 IO :

image

Elas3 IO:
image

Elas4 IO :

image

Elas5 IO :

image

Latest logs from ELas:

[2018-08-16T10:17:34,190][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][95351] overhead, spent [280ms] collecting in the last [1s]

[2018-08-16T10:39:12,924][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][96649] overhead, spent [281ms] collecting in the last [1s]

[2018-08-16T10:39:12,924][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][96649] overhead, spent [281ms] collecting in the last [1s]

[2018-08-16T11:10:08,651][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][98504] overhead, spent [287ms] collecting in the last [1s]

[2018-08-16T11:10:08,651][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][98504] overhead, spent [287ms] collecting in the last [1s]

[2018-08-16T11:24:17,021][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][99352] overhead, spent [302ms] collecting in the last [1s]

[2018-08-16T11:24:17,021][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][99352] overhead, spent [302ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticasia] [gc][99423] overhead, spent [288ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][99427] overhead, spent [288ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][99427] overhead, spent [288ms] collecting in the last [1s]

Thank you

Please do not post images of text when you can instead copy and paste it and format it properly so it is searchable. I can not read any of those...

As CPU seems busy, it would be good to also get the output of the nodes hot threads API when the nodes are busy.

Hello Christian,

Please find the IOSTAT response for ES when it's loaded:

IOSTAT :

ES1 :

avg-cpu: %user %nice %system %iowait %steal %idle
96.37 0.00 2.88 0.38 0.00 0.38

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 338.00 128.00 1063.00 13.11 120.80 230.26 22.71 19.07 1.22 21.22 0.73 87.40

avg-cpu: %user %nice %system %iowait %steal %idle
97.25 0.00 2.00 0.12 0.00 0.62

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 4.50 1.50 2.00 0.04 0.03 40.00 0.01 1.71 4.00 0.00 0.57 0.20
xvdb
0.00 900.50 273.50 358.50 14.61 7.15 70.51 0.35 0.56 0.75 0.41 0.38 24.20

ES05:
avg-cpu: %user %nice %system %iowait %steal %idle
98.25 0.00 1.63 0.00 0.00 0.13

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 662.50 696.00 327.00 10.75 5.83 33.20 0.53 0.52 0.53 0.50 0.23 23.60

avg-cpu: %user %nice %system %iowait %steal %idle
96.74 0.00 2.63 0.00 0.00 0.63

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 869.50 492.50 435.00 8.01 6.98 33.10 0.60 0.64 0.82 0.45 0.27 24.80

and

image

image

Hot threads API response is : ( for every elasticsearch server) :

Nodes hot threads API response (logs are over the limit that i can upload here and no txt attachement is allowed so I'll just copy some of the response for every server :

ES1:

::: Hot threads at 2018-08-17T05:41:15.097Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Hot threads at 2018-08-17T05:41:15.097Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

35.7% (178.4ms out of 500ms) cpu usage by thread 'elasticsearch[amaelknges01][search][T#7]'
3/10 snapshots sharing following 24 elements

35.1% (175.7ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#6]'
3/10 snapshots sharing following 22 elements

Es2:
61.2% (306.1ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#7]'
7/10 snapshots sharing following 21 elements
40.0% (200ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#2]'
10/10 snapshots sharing following 19 elements

Es3:
Hot threads at 2018-08-17T05:41:15.100Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

49.3% (246.2ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#3]'
3/10 snapshots sharing following 26 elements

43.6% (217.8ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#1]'
2/10 snapshots sharing following 25 elements

Thank you

It seems like you are limited due to CPU and that it is searches/aggregations that are using the majority of CPU. How many concurrent queries are you running? What type of searches/aggregations are you running?

Hi

If i want to break ES it's enough to run the Dashboard with graph for 7 days , that will generate enough load to break it ( please note that logs for one day are between 160-200Gb and we store the logs for 7 days) . Even in discovery if I run a query for longer then 2-3 days the load goes crazy.
Concurrent queries let's say around 10 because a team is using it.
Should I consider upgrading the CPUs on ES servers or considering adding more ES to the cluster?

Thank you

How many visualisations do you typically have on each dashboard?

It would appear your cluster does not have enough resources to process your queries, so upgrading the nodes or adding simply scaling out by adding additional nodes might be a good idea.

We use around 4 visualisations on each dashboard , if we limit that to last 30mins we are good, if we go for 24h we get the timeout ( 7 days is out of discussion :slight_smile: )
I will consider upgrading the nodes and if that will not help I will add additional es nodes.
Thank you for you time , I'll post after the changes.

4 vizualizations per dashboard sounds quite normal. The reason I asked is that I have seen users add lots of visualisations per dashboard, which can generate a lot of load as each of them will result in a separate aggregation query.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.