Frequent GC in elasticsearch

Hi, we are running a 4 node ES cluster. Sometimes there are frequent GC in elasticsearch nodes and it starts crash looping for some time. It seems it is crash looping because of OOM. It happens randomly and very less frequently.

  1. Is there a way to confirm that ES is crashing because of OOM. There is no error message with OOM in ES logs as such.
  2. Can we monitor something/enable more logging for this?

Adding some stats below. Please note that these stats are in normal condition, since we do not know when the issue happens.
ES_HEAP is set to 1G.
In top command, Virtual memory used by ES process is 15G. Resident memory looks fine. Can we check why virtual memory is so high ?
Thanks.

Output of "_nodes/stats" API on one of the nodes:

"6LlL-senRSSPPzRF-uWzfw" : {
  "indices" : {
"docs" : {
  "count" : 18760592,
  "deleted" : 7174232
},
"store" : {
  "size_in_bytes" : 4074668631,
  "throttle_time_in_millis" : 0
},
"indexing" : {
  "index_total" : 5254,
  "index_time_in_millis" : 1613096,
  "index_current" : 0,
  "delete_total" : 5011,
  "delete_time_in_millis" : 1241551,
  "delete_current" : 0,
  "noop_update_total" : 0,
  "is_throttled" : false,
  "throttle_time_in_millis" : 0
},
"merges" : {
  "current" : 0,
  "current_docs" : 0,
  "current_size_in_bytes" : 0,
  "total" : 54,
  "total_time_in_millis" : 64200,
  "total_docs" : 305024,
  "total_size_in_bytes" : 61877981,
  "total_stopped_time_in_millis" : 0,
  "total_throttled_time_in_millis" : 0,
  "total_auto_throttle_in_bytes" : 314572800
},
"refresh" : {
  "total" : 1857,
  "total_time_in_millis" : 447633
},
"flush" : {
  "total" : 417,
  "total_time_in_millis" : 557973
},
"fielddata" : {
  "memory_size_in_bytes" : 0,
  "evictions" : 0
},
"percolate" : {
  "total" : 0,
  "time_in_millis" : 0,
  "current" : 0,
  "memory_size_in_bytes" : -1,
  "memory_size" : "-1b",
  "queries" : 0
},
"completion" : {
  "size_in_bytes" : 0
},
"segments" : {
  "count" : 202,
  "memory_in_bytes" : 19379660,
  "terms_memory_in_bytes" : 16196400,
  "stored_fields_memory_in_bytes" : 3156496,
  "term_vectors_memory_in_bytes" : 0,
  "norms_memory_in_bytes" : 8160,
  "doc_values_memory_in_bytes" : 18604,
  "index_writer_memory_in_bytes" : 0,
  "index_writer_max_memory_in_bytes" : 7680000,
  "version_map_memory_in_bytes" : 0,
  "fixed_bit_set_memory_in_bytes" : 0
},
  "os" : {
"timestamp" : 1470821759940,
"load_average" : 41.86,
"mem" : {
  "total_in_bytes" : 67278057472,
  "free_in_bytes" : 1224552448,
  "used_in_bytes" : 66053505024,
  "free_percent" : 2,
  "used_percent" : 98
},
"swap" : {
  "total_in_bytes" : 0,
  "free_in_bytes" : 0,
  "used_in_bytes" : 0
}
  },
  "process" : {
"timestamp" : 1470821759941,
"open_file_descriptors" : 871,
"max_file_descriptors" : 8192,
"cpu" : {
  "percent" : 0,
  "total_in_millis" : 568010
},
"mem" : {
  "total_virtual_in_bytes" : 16815521792
}
  },
  "jvm" : {
"timestamp" : 1470821759944,
"uptime_in_millis" : 36490754,
"mem" : {
  "heap_used_in_bytes" : 383695600,
  "heap_used_percent" : 36,
  "heap_committed_in_bytes" : 1037959168,
  "heap_max_in_bytes" : 1037959168,
  "non_heap_used_in_bytes" : 79842504,
  "non_heap_committed_in_bytes" : 81248256,
  "pools" : {
    "young" : {
      "used_in_bytes" : 284070776,
      "max_in_bytes" : 286326784,
      "peak_used_in_bytes" : 286326784,
      "peak_max_in_bytes" : 286326784
    },
    "survivor" : {
      "used_in_bytes" : 17477200,
      "max_in_bytes" : 35782656,
      "peak_used_in_bytes" : 29910688,
      "peak_max_in_bytes" : 35782656
    },
    "old" : {
      "used_in_bytes" : 82147624,
      "max_in_bytes" : 715849728,
      "peak_used_in_bytes" : 82147624,
      "peak_max_in_bytes" : 715849728
    }
  }
},
"threads" : {
  "count" : 227,
  "peak_count" : 237
},
"gc" : {
  "collectors" : {
    "young" : {
      "collection_count" : 35,
      "collection_time_in_millis" : 1008
    },
    "old" : {
      "collection_count" : 1,
      "collection_time_in_millis" : 98
    }
  }
},
"buffer_pools" : {
  "direct" : {
    "count" : 451,
    "used_in_bytes" : 62089757,
    "total_capacity_in_bytes" : 62089757
  },
  "mapped" : {
    "count" : 52,
    "used_in_bytes" : 1116108764,
    "total_capacity_in_bytes" : 1116108764
  }
}
  },

If it's OOMing then it'll definitely log something to that effect.

How do you know it is GC? Are you using Marvel?

We are running ES with -XX:+PrintGCDateStamps -XX:+PrintGCDetails. So I looked at the GC timestamps. Before crashing it dumps the heap, but there is no "OOM" or "Out of memory" kind of error in log files.
We are not using marvel. Pasting snippet below:

2016-08-01T14:19:51.022-0700: [GC (Allocation Failure) [ParNew: 292901K->10192K(314560K), 0.0082036 secs] 802996K->520568K(1013632K), 0.0083479 secs] [Times: user=0.04 sys=0.02, real=0.00 secs]
2016-08-01T14:20:16.960-0700: [GC (Allocation Failure) [ParNew: 289808K->10292K(314560K), 0.0039105 secs] 800184K->520936K(1013632K), 0.0040721 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2016-08-01T14:20:44.726-0700: [GC (Allocation Failure) [ParNew: 289908K->7794K(314560K), 0.0090382 secs] 800552K->518767K(1013632K), 0.0092109 secs] [Times: user=0.06 sys=0.01, real=0.01 secs]
2016-08-01T14:20:59.964-0700: [GC (Allocation Failure) [ParNew: 287410K->11938K(314560K), 0.0044746 secs] 798383K->523416K(1013632K), 0.0046101 secs] [Times: user=0.04 sys=0.00, real=0.00 secs]
2016-08-01T14:21:24.981-0700: [GC (Allocation Failure) [ParNew: 291554K->8420K(314560K), 0.0085039 secs] 803032K->520439K(1013632K), 0.0086337 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
2016-08-01T14:22:14.829-0700: [GC (Allocation Failure) [ParNew: 288036K->8933K(314560K), 0.0067369 secs] 800055K->521143K(1013632K), 0.0068921 secs] [Times: user=0.04 sys=0.03, real=0.01 secs]
Heap
 par new generation   total 314560K, used 259804K [0x00000000c0000000, 0x00000000d5550000, 0x00000000d5550000)
  eden space 279616K,  89% used [0x00000000c0000000, 0x00000000cf4fdbf0, 0x00000000d1110000)
  from space 34944K,  25% used [0x00000000d3330000, 0x00000000d3be94b8, 0x00000000d5550000)
  to   space 34944K,   0% used [0x00000000d1110000, 0x00000000d1110000, 0x00000000d3330000)
 concurrent mark-sweep generation total 699072K, used 512210K [0x00000000d5550000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 41021K, capacity 41276K, committed 41836K, reserved 1085440K
  class space    used 4828K, capacity 4907K, committed 5036K, reserved 1048576K

Hi, Can someone help me here. We are again seeing this issue on another cluster. It was crashing with heap size 1G. We kept on increasing it and it stopped crashing at ES_HEAP_SIZE 4G. Currently heap usage shown by _nodes/stats/jvm API shows around 1.7G, but doing "top" on ES process shows 4G resident memory and this keep on increasing. On fresh start, resident memory is 3G and it keeps on increasing, so if heap used is 1.7G, what is rest of the memory getting used for ?
Is there a way to estimate the memory required by ES based on some parameters like number of docs?

Periodic output of top for ES process:
42692 42688 xxxxx 20 0 23.182g 4.088g 936948 S 0.7 6.5 72:35.77 java
42692 42688 xxxxx 20 0 23.182g 4.088g 936948 S 6.3 6.5 72:35.96 java
42692 42688 xxxxx 20 0 23.182g 4.089g 936948 S 1.0 6.5 74:10.61 java
42692 42688 xxxxx 20 0 23.182g 4.089g 936948 S 1.3 6.5 74:10.65 java

output of _nodes/stats/jvm API:
"jvm" : {
"timestamp" : 1471754894560,
"uptime_in_millis" : 40150339,
"mem" : {
"heap_used_in_bytes" : 1818970672,
"heap_used_percent" : 43,
"heap_committed_in_bytes" : 4151836672,
"heap_max_in_bytes" : 4151836672,
"non_heap_used_in_bytes" : 83648032,
"non_heap_committed_in_bytes" : 84856832,
"pools" : {
"young" : {
"used_in_bytes" : 312146792,
"max_in_bytes" : 1145372672,
"peak_used_in_bytes" : 1145372672,
"peak_max_in_bytes" : 1145372672
},
"survivor" : {
"used_in_bytes" : 33560392,
"max_in_bytes" : 143130624,
"peak_used_in_bytes" : 116762600,
"peak_max_in_bytes" : 143130624
},
"old" : {
"used_in_bytes" : 1473267064,
"max_in_bytes" : 2863333376,
"peak_used_in_bytes" : 1473267064,
"peak_max_in_bytes" : 2863333376
}
}
},

Resident is controlled by the OS, not ES.

I think the question here is whether ES is using memory other heap - e.g. mmaping the files and leaking those handles or something.
Also, what is the expected minimum amount of memory per node to manage 5.6GB of index data? Surely it can't be 4G.

It is not, it only uses heap.
The OS caches all sorts of other data that ES may use though, but this is the OS.

See Heap: Sizing and Swapping | Elasticsearch: The Definitive Guide [2.x] | Elastic

Ok, then the question is: how do we tell lucene to not cache too much data in memory.
[AFAIK, buffer cache is separate - so this must be something that lucene is doing].

It's not lucene, it's not ES.

It's the OS, caching files that ES (via lucene) is asking for.

This sort of memory use would happen if you had any sort of system that constantly asked for files. The OS caches ones that are commonly requested to that response times are minimal.

It is not application memory use.
It is not a memory leak.
It will not be causing your GC.

What sort of data is it? What are your mappings?
What sort of queries are you running?
Do you have Marvel installed? If not, then why not?

1 Like