Monitoring -- interpretation of strange results

gtorrance · April 24, 2017, 5:09pm

Can anyone help me interpret what is going on with our server based on the following screenshots? We have a very simple setup: one ELK server on Linux and one Filebeat client on Windows. When trying to process a backlog of a few months of log files the system essentially collapses under the weight (i.e. it becomes almost unresponsive, and Filebeat and Logstash generate constant errors due to not being able to communicate with ES). (There are only a few hundred files, each just a few hundred KB. Nothing major.)

What do you make of the following?

Thanks,
Greg

gtorrance · April 24, 2017, 5:09pm

Meant to add another screenshot...

gtorrance · April 24, 2017, 5:14pm

The ELK server is running on a VMWare vCloud server with 8GB ram, 4 CPUs, and 200GB of disk.

gtorrance · April 24, 2017, 5:33pm

BTW, I believe the "stable" time from about 10:30 to 11:30 was when Filebeat had been turned off.

gtorrance · April 24, 2017, 6:21pm

I'm seeing some errors such as the following in the Elasticsearch logs:

[2017-04-24T13:48:35,029][INFO ][o.e.i.IndexingMemoryController] [FBp7aLX] now throttling indexing for shard [[tab-backgrounder-2016.12.25][2]]: segment writing can't keep up

gtorrance · April 24, 2017, 6:48pm

Also seeing quite a number of these:

[2017-04-24T12:39:38,989][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [FBp7aLX] failed to execute on node [FBp7aLX3STOJDfR9zTbeBA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [FBp7aLX][127.0.0.1:9300][cluster:monitor/nodes/stats[n]] request_id [453052] timed out after [15014ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:916) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) [elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
[2017-04-24T12:39:38,989][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [FBp7aLX] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [FBp7aLX3STOJDfR9zTbeBA]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1032) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:915) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) [elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [FBp7aLX][127.0.0.1:9300][cluster:monitor/nodes/stats[n]] request_id [453052] timed out after [15014ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:916) ~[elasticsearch-5.3.0.jar:5.3.0]
... 4 more

vinceh · April 24, 2017, 7:58pm

Hey Greg,

You can see in the first attachment that you are running out of heap.
This will trigger an allocation faulure and the super long GC's you're
seeing.

In your jvm.options, give elastic more memory to work with:

vinceh · April 24, 2017, 8:20pm

Apparently email -> list got rid of my config items:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

 -Xms4g 
 -Xmx4g

gtorrance · April 25, 2017, 12:19pm

Thank you, Vince! I really appreciate the response.

I did as you suggested, and the situation seem to be much better. I'll post screenshots below. If you see anything further you'd recommend I adjust, please let me know.

BTW, I am now getting a flood of the following errors in my Logstash log. Seems to have to do with "rate limiting". No need to respond on this (unless you have some suggestions). I see others have posted similar questions, so I'm going to see what I can glean from that.

[2017-04-25T07:51:36,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@468fe101 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@52a47e68[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 745231]]"})
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Action

Thanks again,
Greg

Topic		Replies	Views
Elastic very slow, keep getting timeouts, yet small index Elasticsearch	7	3629	April 23, 2017
Elasticsearch service fails Elasticsearch	5	1347	November 28, 2022
Elasticsearch Cluster Timeouts Elasticsearch	13	2908	August 17, 2018
Gc overhead reduces ElasticSearch Performance Elasticsearch	13	13529	August 25, 2018
Timeout - Pipeline blocked Elasticsearch	3	1021	August 18, 2017

Monitoring -- interpretation of strange results

Related topics