Garbage Collection blackout

Hi all,

I have set up a ELK server (mostly using defaults) on a VM with 2GB of memory and 200GB of disk. It has been been loading about 250 log files from another server using Filebeat. (The files are only about 300KB each.)

Performance has been really terrible, though, and log files are showing numerous errors.

Also -- (and the reason for this message) -- I've been seeing extended time periods where the ES server appears to almost completely black out while doing GC. Notice the logs below, which seem to indicate an 8.8 minute period of GC. (I've see other occurrences of this with over 20 minutes of GC.)

Below are two images from X-Pack monitoring (for the same period). Notice the 8 minutes of nothingness.

This isn't healthy, right? Thoughts?

Thanks,
Greg

[2017-04-07T13:34:15,260][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5157] overhead, spent [1s] collecting in the last [1.1s]
[2017-04-07T13:34:17,931][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][young][5159][3899] duration [1.5s], collections [1]/[1.6s], total [1.5s]/[29.2m], memory [1.9gb]->[1.9gb]/[1.9gb], all_pools {[young] [62mb]->[514kb]/[66.5mb]}{[survivor] [8.3mb]->[8.3mb]/[8.3mb]}{[old] [1.9gb]->[1.9gb]/[1.9gb]}
[2017-04-07T13:34:17,931][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5159] overhead, spent [1.5s] collecting in the last [1.6s]
[2017-04-07T13:34:25,167][INFO ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5166] overhead, spent [498ms] collecting in the last [1.2s]
[2017-04-07T13:34:26,169][INFO ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5167] overhead, spent [377ms] collecting in the last [1s]
[2017-04-07T13:34:30,170][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5171] overhead, spent [590ms] collecting in the last [1s]

*** blackout for 8 minutes ***

[2017-04-07T13:43:24,673][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][old][5175][12] duration [8.8m], collections [2]/[8.8m], total [8.8m]/[11.1m], memory [1.9gb]->[679mb]/[1.9gb], all_pools {[young] [16.5mb]->[958.5kb]/[66.5mb]}{[survivor] [7mb]->[0b]/[8.3mb]}{[old] [1.9gb]->[678.8mb]/[1.9gb]}
[2017-04-07T13:43:24,765][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5175] overhead, spent [8.8m] collecting in the last [8.8m]
[2017-04-07T13:43:28,693][INFO ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][young][5178][3908] duration [888ms], collections [1]/[1.9s], total [888ms]/[29.3m], memory [718.1mb]->[687.5mb]/[1.9gb], all_pools {[young] [39.3mb]->[36.2kb]/[66.5mb]}{[survivor] [0b]->[8.3mb]/[8.3mb]}{[old] [678.8mb]->[679.1mb]/[1.9gb]}

1 Like

Yeah, that isn't healthy. This is your smoking gun:

[2017-04-07T13:43:24,765][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5175] overhead, spent [8.8m] collecting in the last [8.8m]

A couple of things:

  1. If you have a heap of 2GB on a machine with 2GB of ram then you aren't going to get any disk caching so your performance is going to be terrible. We recommend no more than half the RAM being used for the heap in general.
  2. That recommendation is trouble given that you can't run what you have with 2GB, much less 1GB. It loks like you have 1,205 shards which is quite a bit. For small indexes you should declare them as 1 shard. I'd try lower that number first and then investigate further.

We've talked about having a test for the number of shards on a node based on memory but never done it. Situations like yours are a good reason to have it though.

1 Like

Thanks Nic!

I increased the VM memory to 8GB and everything seems to be running smoothly now. Also, it looks like disk space allocated to /var/lib/elasticsearch was running low, so I increased that, too. No longer seeing any errors in elasticsearch.log.

I appreciate the help!

Greg

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.