Gc overhead, spent [] collecting in the last [], causing crashes

Im getting this alot:

Which i think means something is wrong with my resource amounts. The VM has 64GB memory so that should be plenty. The jvm.options has -Xms16g and -Xmx16g .
What else is relevant here? There is something about shards and indices im not really understanding

I tried reading this https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster, but i just dont really understand what im dealing with yet. I followed the "getting started" guide, and a couple of days later this happened. Not sure how many shards or indexes i have. There should not be a large amount of data. I have had 2 beats running for a couple of days on stale devices, so i cant imagine the space usage being the issue.

I think this is crashing my logstash aswell, because it dies with an error that peer reset connection.

Could this have something to do with a elasticsearch template? I just followed the "getting started" guide and it said nothing about a template. I think maybe the default one allocates only 1gb memory? How do i change this? Do i have to change it in the beats yml? I can find an elasticsearch template setting there

Hey,

this basically means that Elasticsearch (to be more exact the JVM) is busy with cleaning up its memory instead of doing regular work. The main question is, what needs to be done with 16 gb of memory (which is a bit).

Are you having a lot of shards? Are you running complex queries or aggregations? What is the load on your cluster? You can also enable monitoring to gather some stats of your cluster and check where you memory goes to.

1 Like

The machine running the ELK stack has 64 gb ram. the heap is set to 16gb. I only have 1 other machine feeding it logs, so i feel like i should have more than enough resources. I followed a "getting started" guide, that said nothing about templates or anything, so i think im using default settings other than having changed the heap size in jvm.options . So im thinking maybe i only have 1 shard and that is not enought? My logstash crashes because the connection has been reset by peer

Also, im having some terminologi issues.
A node is my elasticsearch instance right?
A cluster is a bunch of elasticsearch instances, made so you can connect multiple ES instances, if you need resources from more devices right?
A shard, is some data allocation "bucket" right?

Kind regards

yes, your assumptions are pretty much correct.

node == instance of Elasticsearch
cluster == nodes working together
shard == unit of scale, internally a lucene index
index == collection of shards, a container for data

1 Like

Okay. So what do you think is the issue here? When i start my elasticsearch it starts doing that thing. I feel like it should have plenty resources. Maybe it only starts when i open Kibana. Im using it for SIEM.

the first thing is to figure out what triggers this: Is it happening when indexing data, or is it happening when querying data. First try could be to not use kibana and keep indexing data and see what happens.

Also, see my earlier post regarding enabling monitoring to get more insights...

1 Like

I think it happens when querying data. Feels like it breaks every time i open the browser.
How do i enable the monitoring?

See https://www.elastic.co/guide/en/elasticsearch/reference/7.6/monitor-elasticsearch-cluster.html

Can you provide more information what kind of queries you are executing against what number of shards? Also, when this happens, can you collect hot threads and share the output in a gist/pastebin?

1 Like

I started elasticsearch, then logstash, then metricbeat. WHen i started metricbeat the elasticsearch log started showing the Update_mapping and them some crashes. You should be able to see the output of all the things here, although maybe not the easiest way to go through it.

I did not start kibana but it did say overhead, spent... before the logstash update_mapping startet

metricbeat paste:
https://pastebin.com/Ehv1cSr4

Unfortunately i cant really telt you abything about the shards. Not sure how to check this. I'm using default configs for the most part, so maybe just 1 shard?

please do copy & paste screenshots, but take your time and put this into proper formatted text snippets. This is super hard to read.

One more question: Does that GC overhead message keep occuring or does it only occur once?

You can use the cat shards API to share shard information.

1 Like

Yea i understand that was less than optimal, but it was to attempt to show what was happening at the same time in multiple terminals.
With elasticsearch and logstash started i ran curl -X GET "localhost:9200/_cat/shards?pretty" and got this returned


Do i have 2 of every shard? Is it supposed to be like that?

It doesnt look like anything is hoarding resources, but i dont have winlogbeat and kibana running, and Elasticsearch isnt currently complaining about gc overhead

Will try again with running both.

***EDIT

Okay i tried running everything again, elasticsearch, logstash, kibana, winlogbeat, and am now back to my original error where logstash crashes with this error:


Elasticsearch isnt complaining about gc overhead though.

Those are still screenshots and not text. This makes it hard to read on mobile, and impossible on non-screen devices.

So, do you have any errors on the elasticsearch side now or just logstash? Again sharing more data would be incredibly helpful instead of snippets in images.

Ahh okay ill copy/paste the text from now on. Currently elasticsearch isnt giving the real error, only logstash. Do you want me to paste the error, or can you see it in the picture? What more data can i provide you?
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by com.headius.backport9.modules.Modules (file:/home/siem/logstash-7.6.1/logstash-core/lib/jars/jruby-complete-9.2.9.0.jar) to method sun.nio.ch.NativeThread.signal(long) WARNING: Please consider reporting this to the maintainers of com.headius.backport9.modules.Modules WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Sending Logstash logs to /home/siem/logstash-7.6.1/logs which is now configured via log4j2.properties [2020-03-20T14:29:40,514][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified [2020-03-20T14:29:40,834][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"7.6.1"} [2020-03-20T14:29:43,344][INFO ][org.reflections.Reflections] Reflections took 60 ms to scan 1 urls, producing 20 keys and 40 values [2020-03-20T14:29:44,572][INFO ][logstash.outputs.elasticsearch][main] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}} [2020-03-20T14:29:44,870][WARN ][logstash.outputs.elasticsearch][main] Restored connection to ES instance {:url=>"http://localhost:9200/"} [2020-03-20T14:29:45,045][INFO ][logstash.outputs.elasticsearch][main] ES Output version determined {:es_version=>7} [2020-03-20T14:29:45,074][WARN ][logstash.outputs.elasticsearch][main] Detected a 6.x and above cluster: the typeevent field won't be used to determine the document _type {:es_version=>7} [2020-03-20T14:29:45,253][INFO ][logstash.outputs.elasticsearch][main] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]} [2020-03-20T14:29:45,488][WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge][main] A gauge metric of an unknown type (org.jruby.specialized.RubyArrayOneObject) has been create for key: cluster_uuids. This may result in invalid serialization. It is recommended to log an issue to the responsible developer/development team. [2020-03-20T14:29:45,519][INFO ][logstash.javapipeline ][main] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>250, "pipeline.sources"=>["/home/siem/logstash-7.6.1/config/demo-metrics-pipeline.conf"], :thread=>"#<Thread:0x4c9770f2 run>"} [2020-03-20T14:29:46,850][INFO ][logstash.inputs.beats ][main] Beats inputs: Starting input listener {:address=>"0.0.0.0:5044"} [2020-03-20T14:29:46,875][INFO ][logstash.javapipeline ][main] Pipeline started {"pipeline.id"=>"main"} [2020-03-20T14:29:47,026][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]} [2020-03-20T14:29:47,172][INFO ][org.logstash.beats.Server][main] Starting server on port: 5044 [2020-03-20T14:29:47,593][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600} [2020-03-20T14:34:11,941][INFO ][org.logstash.beats.BeatsHandler][main] [local: 0.0.0.0:5044, remote: 172.16.10.102:59127] Handling exception: Connection reset by peer [2020-03-20T14:34:11,971][WARN ][io.netty.channel.DefaultChannelPipeline][main] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?] at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?] at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?] at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) ~[netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) [netty-all-4.1.30.Final.jar:4.1.30.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.30.Final.jar:4.1.30.Final] at java.lang.Thread.run(Thread.java:834) [?:?]

It looks like a connection issue between logstash and Elasticsearch - which means that logstash could not reach Elasticsearch. Again this could be a garbage collection on Elasticsearch, or a regular network issue - this is why it is so important to check for both sides.

Also, does logstash recover from this?

I tried posting the error in preformatted text, but that looks impossible to read?

Logstash does not recover from this. It feels like it breaks when i start/restart winlogbeat, which is the only thing shipping data to logstash atm.

also unrelated but out of curiosity, whats a non-screen device? :stuck_out_tongue:

ok, if it does not recover, that looks like a bug to me. Please open an issue in the logstash repo then and provide as much info as possible.

before doing that, can you ensure that you can reach the elasticsearch port from the logstash machine to make sure this is a process and not a system issue?

For your curiosity: non-screen == braille for example :slight_smile:

1 Like

The entire ELK stack is running on the same machine, and i have been able to get events all the way from winlogbeat->logstash->elasticsearch->kibana.
It worked last friday, but monday i could see that it had broken over the weekend.

Starting elasticsearch, logstash and kibana now is causing nothing to break, but i am not getting anything through. Then i go to the windows PC and restart the winlogbeat service, and then logstash breaks with the connection reset by peer error, from which it never recovers

Hmm. Running everything it looks like it is still not REALLY listening on port 9200?
On ipv4 atleast

It looks like logstash it able to connect to it tho?