upendra
(upendra pisupati)
October 12, 2016, 3:21pm
1
HI,
Our elastic search cluster and Kibana keep crashing when we execute reports. The following are the product versions:
Logstash: 2.4.0
Elasticsearch: 2.4.0
Kibana: 4.6.1
Java: 1.8.0
The following is the error that we get:
Our cluster design is as follows:
Logstash Inputs: 4
Logstash output: 1
ES Master & Data: 5 ( Each one is both master and Data)
ES Client node (with Kibana): 1
ELK cluster is on Centos 7 each with 16 GB RAM. Out of which 4GB is allotted ES_HEAP_SIZE parameter.
We have also tried setting the Node Option parameter to:
exec "{NODE}" --max-old-space-size=100 " {DIR}/src/cli" ${@}
But still our Elasticsearch and Kibana keep crashing.
Thanks,
Upendra
upendra
(upendra pisupati)
October 14, 2016, 10:55am
3
Hi Mark,
Running command jmap -heap pid gives the following:
Attaching to process ID 15144, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.101-b13
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 8589934592 (8192.0MB)
NewSize = 348913664 (332.75MB)
MaxNewSize = 348913664 (332.75MB)
OldSize = 8241020928 (7859.25MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 314048512 (299.5MB)
used = 314048496 (299.49998474121094MB)
free = 16 (1.52587890625E-5MB)
99.99999490524573% used
Eden Space:
capacity = 279183360 (266.25MB)
used = 279183360 (266.25MB)
free = 0 (0.0MB)
100.0% used
From Space:
capacity = 34865152 (33.25MB)
used = 34865136 (33.24998474121094MB)
free = 16 (1.52587890625E-5MB)
99.99995410890507% used
To Space:
capacity = 34865152 (33.25MB)
used = 0 (0.0MB)
free = 34865152 (33.25MB)
0.0% used
concurrent mark-sweep generation:
capacity = 8241020928 (7859.25MB)
used = 8241020896 (7859.249969482422MB)
free = 32 (3.0517578125E-5MB)
99.9999996116986% used
15745 interned Strings occupying 2446000 bytes.
Thanks,
Upendra
Thanks,
Upendra
upendra
(upendra pisupati)
October 14, 2016, 11:01am
4
Hi Mark,
Please see the logs. i am unable to send you complete logs due to space constraints of this forum.
Thanks,
Upendra
spinscale
(Alexander Reelsen)
October 14, 2016, 11:21am
5
Hey,
please use gist or other pastebins to put some logs somewhere (also make sure they dont contain sensitive information) - and keep the format as text. Thanks!
--Alex
upendra
(upendra pisupati)
October 14, 2016, 12:11pm
6
Thanks Alex for that help.
Please find the log entry here :
elasticsearch_error.txt
[2016-10-14 16:14:08,736][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392538]
[2016-10-14 16:14:08,740][WARN ][monitor.jvm ] [arlmselk02_M!D!] [gc][old][67284][32] duration [30s], collections [1]/[30.1s], total [30s]/[13.8m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [30.9mb]->[32.6mb]/[33.2mb]}{[old] [7.6gb]->[7.6gb]/[7.6gb]}
[2016-10-14 16:14:51,912][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392547]
[2016-10-14 16:15:22,077][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392532]
[2016-10-14 16:15:22,077][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392539]
[2016-10-14 16:15:51,794][WARN ][transport ] [arlmselk02_M!D!] Received response for a request that has timed out, sent [71197ms] ago, timed out [30355ms] ago, action [cluster:monitor/nodes/stats[n]], node [{arlmselk06_M$D$}{dbO25mC-SEq0zTAz8qC_2g}{192.168.xxx.xxx}{192.168.xxx.xxx:9300}], id [392528]
[2016-10-14 16:16:53,484][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392350]
[2016-10-14 16:16:53,485][WARN ][transport ] [arlmselk02_M!D!] Transport response handler not found of id [392533]
[2016-10-14 16:18:31,514][ERROR][watcher.input.http ] [arlmselk02_M!D!] failed to execute [http] input for [org.elasticsearch.watcher.watch.Watch@5efed351]
ElasticsearchTimeoutException[failed to execute http request. timeout expired]; nested: SocketTimeoutException[Read timed out];
This file has been truncated. show original
Regards,
Upendra
spinscale
(Alexander Reelsen)
October 14, 2016, 12:29pm
7
If you read that log, you can spot an out of memory exception. This means you have to restart your node immediately, as the behaviour after such an exception is not specified (you just dont know if everything works or not).
However in order to prevent those issues in the future, you should find out what triggers this exception. Is it a special query?
You might want to read the following docs regarding to that topic
I have a 10 machine cluster where frequently (about once per day when
indexing and querying is at its height) one elasticsearch node goes OOM...
It usually recovers, but by this time the cluster is redistributing the
lost shards, which causes more load, which often in turn causes an OOM on
another machine.
Each machine has 32GB memory of which I currently have 12GB allocated to
Elasticsearch. I have logstash (max 500M) and redis (max 2GB) running on
the machines too, and see that the rem…
You can use the cat APIs or monitoring to see if you have continously rising memory usages or spikes which cause this behaviour.
Hope this helps.
--Alex
warkolm
(Mark Walkom)
October 15, 2016, 2:30am
8
You should definitely be using Marvel as well.