Elasticsearch Cluster Timeouts

Galambos_Robert · August 9, 2018, 1:51pm

Hello

It's my first topic , so I apreciate any help

Scenario : I have filebeat installed on multiple windows machines sending data to 3 logstash (ubuntu servers). Filebeat is configured to spread the load evenly between all servers, this is the only way to achive 100% load balancing without additional proxy servers.
The logstash servers indexes the logs and sends them to 5 Elasticsearch servers.
The Kibana server connects to an elasticsearch instance running on localhost. The local ES instance acts as a node connected to all ES servers.

Daily data is around 200Gb. I need the data (be able to search from Kibana for max 7 days). Note that logs breaks down into 5shards ( 0 replicas )

elasticsearch :
"version" : {
"number" : "5.6.10",
"build_hash" : "b727a60",
"build_date" : "2018-06-06T15:48:34.860Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"

logstash
version:logstash 5.6.10

kibana
Version: 5.6.10

Please note that all servers are on the same subnet ( private network only).

From 1 week I receive a lot of timeouts (Timeout after 30000ms) and (Bad Gateway errors) specialy when searching for bigger data (elasticsearch fails - in logs I can see timeouts)

Kibana:8gb ram , dual core proc 2,30Ghz
ES: 16gb ram (java set to use 8gb) , quad proc
LS: 16gb ram (java set to use 8gb) , quad proc

Q: Is this env not enough for the daily 200GB of logs? Should i add more ES or some upgrade on ELK will solve something?
Thanks in advance.

Regards.

Christian_Dahlqvist · August 11, 2018, 6:39pm

Can you provide the full output of the cluster stats API? This will give us an overview of the cluster and a better idea of what may be going wrong.

What type of storage does the Elasticsearch nodes have (local SSD, local spinning disk, SAN...)? What does disk I/O and performance look like when queries time out?

Galambos_Robert · August 15, 2018, 7:51am

Hello

cluster API stats when everything is running ok:

curl -s -XGET 'localhost:9200/_cluster/stats?human&pretty'
{
"_nodes" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"cluster_name" : "blabla",
"timestamp" : 1534312573747,
"status" : "green",
"indices" : {
"count" : 25,
"shards" : {
"total" : 99,
"primaries" : 94,
"replication" : 0.05319148936170213,
"index" : {
"shards" : {
"min" : 2,
"max" : 5,
"avg" : 3.96
},
"primaries" : {
"min" : 1,
"max" : 4,
"avg" : 3.76
},
"replication" : {
"min" : 0.0,
"max" : 4.0,
"avg" : 0.2
}
}
},
"docs" : {
"count" : 4476577857,
"deleted" : 3
},
"store" : {
"size" : "3.6tb",
"size_in_bytes" : 4035471747522,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "3.8gb",
"memory_size_in_bytes" : 4125258509,
"total_count" : 46946327,
"hit_count" : 13910459,
"miss_count" : 33035868,
"cache_size" : 64037,
"cache_count" : 579833,
"evictions" : 515796
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 2306,
"memory" : "11.7gb",
"memory_in_bytes" : 12617781947,
"terms_memory" : "10.4gb",
"terms_memory_in_bytes" : 11235140920,
"stored_fields_memory" : "959.3mb",
"stored_fields_memory_in_bytes" : 1005949768,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "747kb",
"norms_memory_in_bytes" : 764992,
"points_memory" : "88.9mb",
"points_memory_in_bytes" : 93283959,
"doc_values_memory" : "269.5mb",
"doc_values_memory_in_bytes" : 282642308,
"index_writer_memory" : "134.3mb",
"index_writer_memory_in_bytes" : 140901412,
"version_map_memory" : "2.8mb",
"version_map_memory_in_bytes" : 2940069,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1534309204962,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 6,
"data" : 5,
"coordinating_only" : 1,
"master" : 5,
"ingest" : 5
},
"versions" : [
"5.6.10"
],
"os" : {
"available_processors" : 22,
"allocated_processors" : 22,
"names" : [
{
"name" : "Linux",
"count" : 6
}
],
"mem" : {
"total" : "86.1gb",
"total_in_bytes" : 92499255296,
"free" : "1.8gb",
"free_in_bytes" : 1989488640,
"used" : "84.2gb",
"used_in_bytes" : 90509766656,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 123
},
"open_file_descriptors" : {
"min" : 365,
"max" : 490,
"avg" : 394
}
},
"jvm" : {
"max_uptime" : "5.7d",
"max_uptime_in_millis" : 494496965,
"versions" : [
{
"version" : "1.8.0_181",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.181-b13",
"vm_vendor" : "Oracle Corporation",
"count" : 6
}
],
"mem" : {
"heap_used" : "27.2gb",
"heap_used_in_bytes" : 29221230968,
"heap_max" : "44.8gb",
"heap_max_in_bytes" : 48126623744
},
"threads" : 400
},
"fs" : {
"total" : "6.2tb",
"total_in_bytes" : 6869140582400,
"free" : "2.3tb",
"free_in_bytes" : 2588563095552,
"available" : "2tb",
"available_in_bytes" : 2255513464832
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 6
},
"http_types" : {
"netty4" : 6

HDD: Volume type: GP2 (per ES server)
IOPS : 3900
/dev/xvdb 1.3T 805G 414G 67% /data

Galambos_Robert · August 15, 2018, 7:53am

Hi
When ES has load or I receive Kib timeouts output is :
curl -XGET http://localhost:9200/_cluster/stats?pretty
{
"_nodes" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"cluster_name" : "nameblabla",
"timestamp" : 1534319133347,
"status" : "green",
"indices" : {
"count" : 25,
"shards" : {
"total" : 99,
"primaries" : 94,
"replication" : 0.05319148936170213,
"index" : {
"shards" : {
"min" : 2,
"max" : 5,
"avg" : 3.96
},
"primaries" : {
"min" : 1,
"max" : 4,
"avg" : 3.76
},
"replication" : {
"min" : 0.0,
"max" : 4.0,
"avg" : 0.2
}
}
},
"docs" : {
"count" : 4530553380,
"deleted" : 3
},
"store" : {
"size_in_bytes" : 4081134819869,
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 159464,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 3830803181,
"total_count" : 45171282,
"hit_count" : 13292501,
"miss_count" : 31878781,
"cache_size" : 61488,
"cache_count" : 557974,
"evictions" : 496486
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 2396,
"memory_in_bytes" : 12778767580,
"terms_memory_in_bytes" : 11370192939,
"stored_fields_memory_in_bytes" : 1018267856,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 774528,
"points_memory_in_bytes" : 94411417,
"doc_values_memory_in_bytes" : 295120840,
"index_writer_memory_in_bytes" : 134809840,
"version_map_memory_in_bytes" : 2444625,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1534314054344,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 6,
"data" : 5,
"coordinating_only" : 1,
"master" : 5,
"ingest" : 5
},
"versions" : [
"5.6.10"
],
"os" : {
"available_processors" : 22,
"allocated_processors" : 22,
"names" : [
{
"name" : "Linux",
"count" : 6
}
],
"mem" : {
"total_in_bytes" : 92499255296,
"free_in_bytes" : 1839747072,
"used_in_bytes" : 90659508224,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 191
},
"open_file_descriptors" : {
"min" : 371,
"max" : 489,
"avg" : 397
}
},
"jvm" : {
"max_uptime_in_millis" : 501056508,
"versions" : [
{
"version" : "1.8.0_181",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.181-b13",
"vm_vendor" : "Oracle Corporation",
"count" : 6
}
],
"mem" : {
"heap_used_in_bytes" : 25125369160,
"heap_max_in_bytes" : 48126623744
},
"threads" : 414
},
"fs" : {
"total_in_bytes" : 6869140582400,
"free_in_bytes" : 2782896795648,
"available_in_bytes" : 2449847164928
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 6
},
"http_types" : {
"netty4" : 6

It timeouts when I do a 7 days query with graphs for example, please let me know if you need extra details.

Regards

Galambos_Robert · August 16, 2018, 8:42am

Hi

Should I still use gp2 as volume type and try to increase the Max IOPS/Volume or should I upgrade to io1 (same volume size but with better IOPS/Throughput) ?

Thank you

Christian_Dahlqvist · August 16, 2018, 8:48am

Look to determine what is limiting performance when you are seeing timeouts. Is CPU pegged at 100%? Do you see a lot of iowait if you use iostat? Do you see anything in the Elasticsearch logs that could point to a problem, e.g. long or frequent GC?

Galambos_Robert · August 16, 2018, 11:39am

Hello

Please see the graphs from Kibana and ES nodes when receiving timeouts:

(cpu is pegged at 100%)

Disk RW on ES :

Elasticsearch 1 IO :

Elas 2 IO :

Elas3 IO:

Elas4 IO :

Elas5 IO :

Latest logs from ELas:

[2018-08-16T10:17:34,190][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][95351] overhead, spent [280ms] collecting in the last [1s]

[2018-08-16T10:39:12,924][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][96649] overhead, spent [281ms] collecting in the last [1s]

[2018-08-16T10:39:12,924][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][96649] overhead, spent [281ms] collecting in the last [1s]

[2018-08-16T11:10:08,651][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][98504] overhead, spent [287ms] collecting in the last [1s]

[2018-08-16T11:10:08,651][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][98504] overhead, spent [287ms] collecting in the last [1s]

[2018-08-16T11:24:17,021][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][99352] overhead, spent [302ms] collecting in the last [1s]

[2018-08-16T11:24:17,021][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][99352] overhead, spent [302ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticasia] [gc][99423] overhead, spent [288ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01] [gc][99427] overhead, spent [288ms] collecting in the last [1s]

[2018-08-16T11:25:32,037][INFO ][o.e.m.j.JvmGcMonitorService] [amaelkngki01/elasticeurope] [gc][99427] overhead, spent [288ms] collecting in the last [1s]

Thank you

Christian_Dahlqvist · August 16, 2018, 11:41am

Please do not post images of text when you can instead copy and paste it and format it properly so it is searchable. I can not read any of those...

As CPU seems busy, it would be good to also get the output of the nodes hot threads API when the nodes are busy.

Galambos_Robert · August 17, 2018, 6:01am

Hello Christian,

Please find the IOSTAT response for ES when it's loaded:

IOSTAT :

ES1 :

avg-cpu: %user %nice %system %iowait %steal %idle
96.37 0.00 2.88 0.38 0.00 0.38

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 338.00 128.00 1063.00 13.11 120.80 230.26 22.71 19.07 1.22 21.22 0.73 87.40

avg-cpu: %user %nice %system %iowait %steal %idle
97.25 0.00 2.00 0.12 0.00 0.62

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 4.50 1.50 2.00 0.04 0.03 40.00 0.01 1.71 4.00 0.00 0.57 0.20
xvdb
0.00 900.50 273.50 358.50 14.61 7.15 70.51 0.35 0.56 0.75 0.41 0.38 24.20

ES05:
avg-cpu: %user %nice %system %iowait %steal %idle
98.25 0.00 1.63 0.00 0.00 0.13

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 662.50 696.00 327.00 10.75 5.83 33.20 0.53 0.52 0.53 0.50 0.23 23.60

avg-cpu: %user %nice %system %iowait %steal %idle
96.74 0.00 2.63 0.00 0.00 0.63

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
loop0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb
0.00 869.50 492.50 435.00 8.01 6.98 33.10 0.60 0.64 0.82 0.45 0.27 24.80

and

Hot threads API response is : ( for every elasticsearch server) :

Nodes hot threads API response (logs are over the limit that i can upload here and no txt attachement is allowed so I'll just copy some of the response for every server :

ES1:

::: Hot threads at 2018-08-17T05:41:15.097Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Hot threads at 2018-08-17T05:41:15.097Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

35.7% (178.4ms out of 500ms) cpu usage by thread 'elasticsearch[amaelknges01][search][T#7]'
3/10 snapshots sharing following 24 elements

35.1% (175.7ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#6]'
3/10 snapshots sharing following 22 elements

Es2:
61.2% (306.1ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#7]'
7/10 snapshots sharing following 21 elements
40.0% (200ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#2]'
10/10 snapshots sharing following 19 elements

Es3:
Hot threads at 2018-08-17T05:41:15.100Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

49.3% (246.2ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#3]'
3/10 snapshots sharing following 26 elements

43.6% (217.8ms out of 500ms) cpu usage by thread 'elasticsearch[search][T#1]'
2/10 snapshots sharing following 25 elements

Thank you

Christian_Dahlqvist · August 17, 2018, 6:20am

It seems like you are limited due to CPU and that it is searches/aggregations that are using the majority of CPU. How many concurrent queries are you running? What type of searches/aggregations are you running?

Galambos_Robert · August 17, 2018, 6:39am

Hi

If i want to break ES it's enough to run the Dashboard with graph for 7 days , that will generate enough load to break it ( please note that logs for one day are between 160-200Gb and we store the logs for 7 days) . Even in discovery if I run a query for longer then 2-3 days the load goes crazy.
Concurrent queries let's say around 10 because a team is using it.
Should I consider upgrading the CPUs on ES servers or considering adding more ES to the cluster?

Thank you

Christian_Dahlqvist · August 17, 2018, 6:45am

How many visualisations do you typically have on each dashboard?

It would appear your cluster does not have enough resources to process your queries, so upgrading the nodes or adding simply scaling out by adding additional nodes might be a good idea.

Galambos_Robert · August 17, 2018, 6:53am

We use around 4 visualisations on each dashboard , if we limit that to last 30mins we are good, if we go for 24h we get the timeout ( 7 days is out of discussion )
I will consider upgrading the nodes and if that will not help I will add additional es nodes.
Thank you for you time , I'll post after the changes.

Christian_Dahlqvist · August 17, 2018, 6:56am

4 vizualizations per dashboard sounds quite normal. The reason I asked is that I have seen users add lots of visualisations per dashboard, which can generate a lot of load as each of them will result in a separate aggregation query.

system · September 14, 2018, 6:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeline & Visualize Request Timeout after 30000ms in Kibana Kibana	11	3398	May 22, 2018
One search query puts whole cluster on knees Elasticsearch	16	1518	July 6, 2017
Interesting Issue Elasticsearch	7	441	July 6, 2017
KIbana - Elasticsearch : Request time out after 3000ms Elasticsearch	32	11890	May 10, 2018
Cluster resource usage Elasticsearch	14	447	July 6, 2017

Elasticsearch Cluster Timeouts

Related topics