Heap grows as GC slows down until clients disconnect

Hi,

We have a test setup with ES 2.4.0 on a single m4.large. It used to work ok with the nutch crawler indexing web pages to a content_store index, and another worker reading the web pages and indexing them to a search index, but we switched to storm crawler which is indexing quicker and now we're running into problems. I split the search index to a separate server, and upgraded the m4.large to an m4.xlarge with 16gb ram, but that just postpones the problem. We have three clients reading and writing from the main content_store index and a few others. They're streaming data constantly.

The heap keeps growing until it hits 99% and the GC starts taking over 30s, so all our clients start timing out with:
[#|2016-12-07T05:22:03.653Z|INFO |elasticsearch[Paralyzer][generic][T#187]|o.e.c.transport |[Paralyzer] failed to get node info for {#transport#-1}{x.x.x.x}{x.x.x.x:9300}, disconnecting...|Log4jESLogger.java:125|#]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][172.31.38.111:9300] cluster:monitor/nodes/liveness] request_id [545363] timed out after [5000ms]

And the elasticsearch log shows:

[2016-12-07 06:00:36,981][WARN ][monitor.jvm              ] [Loa] [gc][old][38612][2214] duration [28s], collections [1]/[28.9s], total [28s]/[6.2h], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[young] [247.3mb]->[206.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] 
    [7.6gb]->[7.6gb]/[7.6gb]}
[2016-12-07 05:59:45,617][WARN ][index.indexing.slowlog.index] [Loa][content_store_v3][1] took[29.8s], took_millis[29816],
[2016-12-07 06:00:36,980][WARN ][index.search.slowlog.query] [Loa] [site_configs_v2][1] took[28s], took_millis[28015], types[siteConfig], stats[], search_type[SCAN], total_shards[5], source[{"size":10,"query":{"match_all":{}},"_source":{"includes":[],"excludes":[]}}], extra_source[], 
[2016-12-07 06:03:05,321][WARN ][index.search.slowlog.fetch] [Loa] [content_store_v3][2] took[25.4s], took_millis[25465], types[contentStoreItem], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[{"from":0,"size":500,"post_filter":{"terms":{"reIndex":["true"]}},"sort":[{"lastModified":{"order":"asc"}}]}], extra_source[],

We don't hit an OOM error, but the node just slows down until it's unusable, especially since the GC pauses are longer than the client timeout of 5s.

What would be the best route of attack here?

  1. Are there config options I've missed that could help?
  2. Is there a way to throttle clients automatically instead of them timing out with exceptions?
  3. Should I reduce the default 5 shards and 1 replica, to a single shard?
  4. Throw more hardware at the problem?
  5. Change to a multi-node cluster?
  6. Use doc values for the HTML content instead of "rawHtml": { "type": "string", "index": "no" }?
  7. Use something other than ES for the content store?
  8. ?

Thanks for any pointers and guidance!

Cheers,
Vaughn

Do you have Monitoring installed?

I do now

Marvel is showing the heap growing and growing. I'm waiting for it to run out, and checking the GC so long.

2016-12-07T08:46:44.744+0000: 621.873: Total time for which application threads were stopped: 0.0388704 seconds, Stopping threads took: 0.0000866 seconds
2016-12-07T08:46:45.746+0000: 622.874: Total time for which application threads were stopped: 0.0002759 seconds, Stopping threads took: 0.0000539 seconds
2016-12-07T08:46:45.875+0000: 623.004: Total time for which application threads were stopped: 0.0002376 seconds, Stopping threads took: 0.0000549 seconds
2016-12-07T08:46:48.876+0000: 626.004: Total time for which application threads were stopped: 0.0003122 seconds, Stopping threads took: 0.0000767 seconds
2016-12-07T08:46:49.877+0000: 627.005: Total time for which application threads were stopped: 0.0002179 seconds, Stopping threads took: 0.0000553 seconds
2016-12-07T08:46:50.562+0000: 627.691: [GC (Allocation Failure) 2016-12-07T08:46:50.562+0000: 627.691: [ParNew
Desired survivor size 17432576 bytes, new threshold 3 (max 6)
- age   1:   12723128 bytes,   12723128 total
- age   2:    3177200 bytes,   15900328 total
- age   3:    1767192 bytes,   17667520 total
- age   4:    1026048 bytes,   18693568 total
- age   5:    2219520 bytes,   20913088 total
- age   6:    2915024 bytes,   23828112 total
: 293659K->30396K(306688K), 0.0432564 secs] 859520K->597682K(8354560K), 0.0433556 secs] [Times: user=0.17 sys=0.00, real=0.05 secs] 

It's odd that the times reported here don't match the /_nodes/stats/jvm?human times:

   "gc": {
      "collectors": {
        "young": {
          "collection_count": 143,
          "collection_time": "6.2s",
          "collection_time_in_millis": 6242
        },
        "old": {
          "collection_count": 1,
          "collection_time": "37ms",
          "collection_time_in_millis": 37
        }
      }
    },

Ahhh. Collection times are cumulative...

I upgraded to ES 2.4.2 and enabled GC logging, and now the heap size finally shrinks sometimes. Possibly a bug in 2.4.0 that blocked GC, but I can't see anything in the release notes for it.

No joy. With enough load it eventually grew its heap to 95-99% and started dropping clients due to long GC pauses.

Then you may need to look at adding more nodes.

Got to the bottom of this with heap dumps and interrogating my developers.

Turned out that new code running against our test ES was using a groovy script with concatenated parameters, so ES was compiling each script as a unique script, and due to bugs in groovy and the jvm, it was never GCing them... Switched to passed-in parameters and the problem is gone. Just in time for the weekend! :smiley:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.