Massive Heap Usage (99%) on our "Warm" nodes

Our cluster is operating pretty well .. but we noticed that our warm data nodes are under some intense JVM pressure. This could be self inflicted with our mappings and data size, but I am looking for a way to confirm this.

First, our cluster looks like this:

  • hot ingestion nodes: 40 x i3.2xl (8vcpu, 61gb ram, 30gb dedicated to ES)
  • warm nodes: 10 x d2.2xl (8vcpu, 61gb ram, 30gb dedicated to ES)
  • master nodes : 3 x r3.xlarge

We have three index patterns today.. aws, syslog and cloudtrail logs. Each index runs in a rolling mode based on index size. So..

  • logs-aws-xxxxxx: 16 shards @ 1x replication, roll @ 512gb
  • logs-syslog-xxxxxx: 32 shards @ 1x replication, roll @ 1024gb
  • logs-cloudtrail-xxxxxx: 2 shards @ 1x replication, roll @ 100gb

Finally, we ingest around 2 billion events a day.. ~1.1b into the syslog index, ~900m into the aws index, and a few million into the cloudtrail indices.

On our hot ingestion nodes, we see a pretty good memory pattern.. up and down plenty, but never really breaking ~80%. Here are all the nodes:

Here's one node singled out:

Unfortunately, our warm nodes are doing much worse:


image

Finally, here is the GC log on a single warm node:

{"timeMillis":1524514971934,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"INFO","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132234] overhead, spent [448ms] collecting in the last [1s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":321}}
{"timeMillis":1524514974147,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"INFO","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132236] overhead, spent [437ms] collecting in the last [1.2s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":321}}
{"timeMillis":1524514975318,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"WARN","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132237] overhead, spent [588ms] collecting in the last [1.1s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":316}}
{"timeMillis":1524514978319,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"WARN","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132240] overhead, spent [580ms] collecting in the last [1s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":316}}
{"timeMillis":1524514979656,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"WARN","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132241] overhead, spent [766ms] collecting in the last [1.3s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":316}}
{"timeMillis":1524514982657,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"WARN","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132244] overhead, spent [669ms] collecting in the last [1s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":316}}
{"timeMillis":1524514984012,"thread":"elasticsearch[XXX][scheduler][T#1]","level":"WARN","loggerName":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","marker":{"name":"[XXX] "},"message":"[gc][132245] overhead, spent [793ms] collecting in the last [1.3s]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","threadId":21,"threadPriority":5,"source":{"class":"org.elasticsearch.monitor.jvm.JvmGcMonitorService","method":"logGcOverhead","file":"JvmGcMonitorService.java","line":316}}

It looks to me like the issue has to do with our data patterns - we index, analyze and store keyword fields for basically everything. Looking at some of the metrics, it looks like we're using ~120GB or so (across 10 nodes, right?) for terms/etc:

The same data on the hot nodes (40 nodes) is far lower..

So obviously we are storing less data on each hot node, and prioritizing CPU/IO for them.. but on our warm nodes it seems like something strange is going on here. We have 10 nodes, so thats ~300GB or so in HEAP memory. Only ~40% of that is accounted for in these graphs.. so whats using the other 50%?

Are there things we can do to tune our warm nodes differently? Let them rely more heavily on FS-cache and reading from disk when necessary, keeping less terms/etc in memory?

We're using Oracle Java 1.8.0_152, and here are our JVM settings:

 -Dfile.encoding=UTF-8
 -Dio.netty.noKeySetOptimization=true
 -Dio.netty.noUnsafe=true 
 -Dio.netty.recycler.maxCapacityPerThread=0 
 -Djava.awt.headless=true 
 -Djna.nosys=true 
 -Dlog4j.shutdownHookEnabled=false 
 -Dlog4j2.disable.jmx=true 
 -XX:+AlwaysPreTouch
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:+PrintGCApplicationStoppedTime
 -XX:+PrintGCDateStamps
 -XX:+PrintGCDetails
 -XX:+PrintTenuringDistribution
 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+UseConcMarkSweepGC
 -XX:+UseGCLogFileRotation
 -XX:-OmitStackTraceInFastThrow
 -XX:CMSInitiatingOccupancyFraction=75
 -XX:GCLogFileSize=64m
 -XX:MaxNewSize=19046400000
 -XX:NumberOfGCLogFiles=32 
 -Xloggc:/mnt/log/elasticsearch/xxx/gc.log 
 -Xms31744m 
 -Xmx31744m 
 -Xss1m -server

Here's a node-stat dump from one node thats using around 89% of the heap..

"nodes": {
    "RLYXvghXR4qGYzxFfhS4nA": {
...
      "indices": {
        "docs": {
          "count": 2538073880,
          "deleted": 679959
        },
        "store": {
          "size_in_bytes": 4127888213380
        },
        "indexing": {
          "index_total": 15,
          "index_time_in_millis": 1502,
          "index_current": 0,
          "index_failed": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
          "delete_current": 0,
          "noop_update_total": 0,
          "is_throttled": false,
          "throttle_time_in_millis": 0
        },
        "get": {
          "total": 297,
          "time_in_millis": 570,
          "exists_total": 297,
          "exists_time_in_millis": 570,
          "missing_total": 0,
          "missing_time_in_millis": 0,
          "current": 0
        },
        "search": {
          "open_contexts": 0,
          "query_total": 251,
          "query_time_in_millis": 951222,
          "query_current": 0,
          "fetch_total": 173,
          "fetch_time_in_millis": 135,
          "fetch_current": 0,
          "scroll_total": 0,
          "scroll_time_in_millis": 0,
          "scroll_current": 0,
          "suggest_total": 0,
          "suggest_time_in_millis": 0,
          "suggest_current": 0
        },
        "merges": {
          "current": 1,
          "current_docs": 17320811,
          "current_size_in_bytes": 37535457027,
          "total": 23,
          "total_time_in_millis": 42104198,
          "total_docs": 338173141,
          "total_size_in_bytes": 510221823653,
          "total_stopped_time_in_millis": 0,
          "total_throttled_time_in_millis": 0,
          "total_auto_throttle_in_bytes": 4345957545
        },
        "refresh": {
          "total": 997,
          "total_time_in_millis": 38353,
          "listeners": 0
        },
        "flush": {
          "total": 136,
          "total_time_in_millis": 91251
        },
        "warmer": {
          "current": 0,
          "total": 218,
          "total_time_in_millis": 3
        },
        "query_cache": {
          "memory_size_in_bytes": 0,
          "total_count": 0,
          "hit_count": 0,
          "miss_count": 0,
          "cache_size": 0,
          "cache_count": 0,
          "evictions": 0
        },
        "fielddata": {
          "memory_size_in_bytes": 688,
          "evictions": 0
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 1153,
          "memory_in_bytes": 12451563253,
          "terms_memory_in_bytes": 11811108456,
          "stored_fields_memory_in_bytes": 534591968,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 6424448,
          "points_memory_in_bytes": 93260361,
          "doc_values_memory_in_bytes": 6178020,
          "index_writer_memory_in_bytes": 0,
          "version_map_memory_in_bytes": 0,
          "fixed_bit_set_memory_in_bytes": 0,
          "max_unsafe_auto_id_timestamp": 1524417302501,
          "file_sizes": {

          }
        },
        "translog": {
          "operations": 75406,
          "size_in_bytes": 143709315,
          "uncommitted_operations": 75392,
          "uncommitted_size_in_bytes": 139064017
        },
        "request_cache": {
          "memory_size_in_bytes": 4881,
          "evictions": 0,
          "hit_count": 173,
          "miss_count": 11
        },
        "recovery": {
          "current_as_source": 2,
          "current_as_target": 0,
          "throttle_time_in_millis": 111946628
        }
      },
      "jvm": {
        "timestamp": 1524517719653,
        "uptime_in_millis": 145298191,
        "mem": {
          "heap_used_in_bytes": 27980965256,
          "heap_used_percent": 89,
          "heap_committed_in_bytes": 31381389312,
          "heap_max_in_bytes": 31381389312,
          "non_heap_used_in_bytes": 162423328,
          "non_heap_committed_in_bytes": 173662208,
          "pools": {
            "young": {
              "used_in_bytes": 14684734504,
              "max_in_bytes": 15237185536,
              "peak_used_in_bytes": 15237185536,
              "peak_max_in_bytes": 15237185536
            },
            "survivor": {
              "used_in_bytes": 65498768,
              "max_in_bytes": 1904607232,
              "peak_used_in_bytes": 1904607232,
              "peak_max_in_bytes": 1904607232
            },
            "old": {
              "used_in_bytes": 13230731984,
              "max_in_bytes": 14239596544,
              "peak_used_in_bytes": 14239596544,
              "peak_max_in_bytes": 14239596544
            }
          }
        },
        "threads": {
          "count": 146,
          "peak_count": 150
        },
        "gc": {
          "collectors": {
            "young": {
              "collection_count": 910,
              "collection_time_in_millis": 157719
            },
            "old": {
              "collection_count": 12792,
              "collection_time_in_millis": 23209724
            }
          }
        },
        "buffer_pools": {
          "direct": {
            "count": 125,
            "used_in_bytes": 272173968,
            "total_capacity_in_bytes": 272173967
          },
          "mapped": {
            "count": 5821,
            "used_in_bytes": 4110936989431,
            "total_capacity_in_bytes": 4110936989431
          }
        },
        "classes": {
          "current_loaded_count": 17469,
          "total_loaded_count": 17704,
          "total_unloaded_count": 235
        }
      },
      "breakers": {
        "request": {
          "limit_size_in_bytes": 18828833587,
          "limit_size": "17.5gb",
          "estimated_size_in_bytes": 0,
          "estimated_size": "0b",
          "overhead": 1,
          "tripped": 0
        },
        "fielddata": {
          "limit_size_in_bytes": 26674180915,
          "limit_size": "24.8gb",
          "estimated_size_in_bytes": 688,
          "estimated_size": "688b",
          "overhead": 1.03,
          "tripped": 0
        },
        "in_flight_requests": {
          "limit_size_in_bytes": 31381389312,
          "limit_size": "29.2gb",
          "estimated_size_in_bytes": 527230,
          "estimated_size": "514.8kb",
          "overhead": 1,
          "tripped": 0
        },
        "accounting": {
          "limit_size_in_bytes": 31381389312,
          "limit_size": "29.2gb",
          "estimated_size_in_bytes": 12451563253,
          "estimated_size": "11.5gb",
          "overhead": 1,
          "tripped": 0
        },
        "parent": {
          "limit_size_in_bytes": 21966972518,
          "limit_size": "20.4gb",
          "estimated_size_in_bytes": 12452091171,
          "estimated_size": "11.5gb",
          "overhead": 1,
          "tripped": 0
        }
      }
    }
  }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.