Monitoring -- interpretation of strange results


(Greg T) #1

Can anyone help me interpret what is going on with our server based on the following screenshots? We have a very simple setup: one ELK server on Linux and one Filebeat client on Windows. When trying to process a backlog of a few months of log files the system essentially collapses under the weight (i.e. it becomes almost unresponsive, and Filebeat and Logstash generate constant errors due to not being able to communicate with ES). (There are only a few hundred files, each just a few hundred KB. Nothing major.)

What do you make of the following?

Thanks,
Greg


(Greg T) #2

Meant to add another screenshot...


(Greg T) #3

The ELK server is running on a VMWare vCloud server with 8GB ram, 4 CPUs, and 200GB of disk.


(Greg T) #4

BTW, I believe the "stable" time from about 10:30 to 11:30 was when Filebeat had been turned off.


(Greg T) #5

I'm seeing some errors such as the following in the Elasticsearch logs:

[2017-04-24T13:48:35,029][INFO ][o.e.i.IndexingMemoryController] [FBp7aLX] now throttling indexing for shard [[tab-backgrounder-2016.12.25][2]]: segment writing can't keep up

(Greg T) #6

Also seeing quite a number of these:

[2017-04-24T12:39:38,989][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [FBp7aLX] failed to execute on node [FBp7aLX3STOJDfR9zTbeBA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [FBp7aLX][127.0.0.1:9300][cluster:monitor/nodes/stats[n]] request_id [453052] timed out after [15014ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:916) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) [elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
[2017-04-24T12:39:38,989][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [FBp7aLX] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [FBp7aLX3STOJDfR9zTbeBA]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1032) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:915) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) [elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [FBp7aLX][127.0.0.1:9300][cluster:monitor/nodes/stats[n]] request_id [453052] timed out after [15014ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:916) ~[elasticsearch-5.3.0.jar:5.3.0]
... 4 more


(Vince) #7

Hey Greg,

You can see in the first attachment that you are running out of heap.
This will trigger an allocation faulure and the super long GC's you're
seeing.

In your jvm.options, give elastic more memory to work with:


(Vince) #8

Apparently email -> list got rid of my config items:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

 -Xms4g 
 -Xmx4g

(Greg T) #9

Thank you, Vince! I really appreciate the response.

I did as you suggested, and the situation seem to be much better. I'll post screenshots below. If you see anything further you'd recommend I adjust, please let me know.

BTW, I am now getting a flood of the following errors in my Logstash log. Seems to have to do with "rate limiting". No need to respond on this (unless you have some suggestions). I see others have posted similar questions, so I'm going to see what I can glean from that.

[2017-04-25T07:51:36,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@468fe101 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@52a47e68[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 745231]]"})
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Action

Thanks again,
Greg


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.