Client Nodes being oom-killed

dawiro · May 24, 2018, 9:11am

Hi,
I'm running elasticsearch 5.6.3 and am seeing my client nodes oom-killed periodically.This is odd because heap size is set to 4GB while the server itself has 8GB of RAM and nothing else is running on the node aside form packetbeat.

Can someone help me out with getting more info on these nodes on and off-heap memory usage so I can understand why oom-kill is kicking?

Regards,
D

warkolm · May 24, 2018, 9:12am

If you are not running Monitoring (from X-Pack) then you should install that to see what is happening.

dawiro · May 24, 2018, 9:14am

Is that available in the free version?

warkolm · May 24, 2018, 9:14am

Yep, it's free!

dawiro · May 24, 2018, 9:21am

There's likely to be a lead time until I could implement that. Is there anything I can do in the meantime?

Christian_Dahlqvist · May 24, 2018, 9:43am

If you can provide us with the output of the cluster stats API we would have a better view of the cluster and what is using heap.

dawiro · May 24, 2018, 12:57pm

Cluster stats as requested:

{
  "_nodes" : {
    "total" : 15,
    "successful" : 15,
    "failed" : 0
  },
  "cluster_name" : "elk_prd",
  "timestamp" : 1527165795887,
  "status" : "green",
  "indices" : {
    "count" : 39,
    "shards" : {
      "total" : 742,
      "primaries" : 371,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 20,
          "avg" : 19.025641025641026
        },
        "primaries" : {
          "min" : 1,
          "max" : 10,
          "avg" : 9.512820512820513
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 5868458697,
      "deleted" : 7
    },
    "store" : {
      "size_in_bytes" : 6724892904181,
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 1015712,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 16851845878,
      "total_count" : 785549301,
      "hit_count" : 46445247,      "miss_count" : 739104054,
      "cache_size" : 110755,
      "cache_count" : 511818,
      "evictions" : 401063
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 16516,
      "memory_in_bytes" : 11085201082,
      "terms_memory_in_bytes" : 9108751709,
      "stored_fields_memory_in_bytes" : 1370464368,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 725696,
      "points_memory_in_bytes" : 196900069,
      "doc_values_memory_in_bytes" : 408359240,
      "index_writer_memory_in_bytes" : 718227640,
      "version_map_memory_in_bytes" : 22902486,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 9223372036854775807,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 15,
      "data" : 10,
      "coordinating_only" : 1,
      "master" : 3,
      "ingest" : 2
    },
    "versions" : [
      "5.6.3"
    ],
    "os" : {
      "available_processors" : 47,
      "allocated_processors" : 47,
      "names" : [
        {
          "name" : "Linux",
          "count" : 15
        }
      ],
      "mem" : {
        "total_in_bytes" : 366583300096,
        "free_in_bytes" : 12803936256,
        "used_in_bytes" : 353779363840,        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 395
      },
      "open_file_descriptors" : {
        "min" : 462,
        "max" : 720,
        "avg" : 633
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 3275809847,
      "versions" : [
        {
          "version" : "1.8.0_45",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.45-b02",
          "vm_vendor" : "Oracle Corporation",
          "count" : 15
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 97341295728,
        "heap_max_in_bytes" : 186421411840
      },
      "threads" : 907
    },
    "fs" : {
      "total_in_bytes" : 14962766397440,
      "free_in_bytes" : 8197970984960,
      "available_in_bytes" : 8040249016320,
      "spins" : "true"
    },
    "plugins" : [
      {
        "name" : "discovery-ec2",
        "version" : "5.6.3",
        "description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname" : "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "repository-s3",
        "version" : "5.6.3",        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "mapper-size",
        "version" : "5.6.3",
        "description" : "The Mapper Size plugin allows document to record their uncompressed size at index time.",
        "classname" : "org.elasticsearch.plugin.mapper.MapperSizePlugin",
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 15
      },
      "http_types" : {
        "netty4" : 15
      }
    }
  }
}

Are there any logging settings I can enable on the client nodes to highlight behaviours/issues in advance of oom-kill dropping the axe?

Regards,
D

dawiro · May 25, 2018, 7:33am

@Christian_Dahlqvist see above

Christian_Dahlqvist · May 25, 2018, 7:36am

Do you have any non-standard settings for your nodes, e.g. regarding cache sizes etc? What types of queries are you running?

dawiro · May 25, 2018, 7:54am

@Christian_Dahlqvist I have thread_pool.search.queue_size set to 1500 because the default search queue size was exceeded. The client nodes themselves are configured in quite a vanilla manner.
I'm not intimately familiar with most of the queries hitting the cluster; it's predominantly kibana traffic I'd expect. However, we do have some scripted bulk searches going on, one of which has been reported as running v.slow in the last week or so (oom-kill has been going on since upgrading to 5.6.3 in April). An example of that query looks like this:

{'sort': [{'@timestamp': {'unmapped_type': 'date', 'order': 'asc'}}], 'query': {'bool': {'filter': {'bool': defaultdict(<type 'list'>, {'must': [{'range': {'@timestamp': {'gte': datetime.datetime(2018, 5, 16, 0, 6, 32, 954000), 'lte': datetime.datetime(2018, 5, 16, 0, 7, 25, 652000)}}}]})}, 'must': {'query_string': {'query': 'message: DiagnosticsEvent OR message: StatusEvent'}}}}, 'size': 10000}

This query gets run multiple times over different time windows to collect the log lines required.

Regards,
D

Christian_Dahlqvist · May 25, 2018, 8:02am

The increased queue size will result in more heap being used. It is also hard for me to know how your queries affect heap usage. It looks like some of the aggregations have a large size parameter, that could result in a lot of buckets, driving heap usage.

It is possible that you may need to increase the amount of heap you have on these nodes. As dedicated coordinating nodes do not use the file system page cache, you can increase heap to 75% of available RAM (maybe even a bit higher?) on these nodes. The same applies to any dedicated master nodes, even though that does not seen to be the problem here.

dawiro · May 25, 2018, 8:39am

@Christian_Dahlqvist These nodes are not running out of heap, they're being oom-killed by the linux kernel

dawiro · May 25, 2018, 9:06am

@Christian_Dahlqvist Here's a better example of the query I showed you above (as reported by the data node):

{
	"size": 10000,
	"query": {
		"bool": {
			"must": [{
				"query_string": {
					"query": "message: DiagnosticsEvent OR message: StatusEvent",
					"fields": [],
					"use_dis_max": true,
					"tie_breaker": 0.0,
					"default_operator": "or",
					"auto_generate_phrase_queries": false,
					"max_determinized_states": 10000,
					"enable_position_increments": true,
					"fuzziness": "AUTO",
					"fuzzy_prefix_length": 0,
					"fuzzy_max_expansions": 50,
					"phrase_slop": 0,
					"escape": false,
					"split_on_whitespace": true,
					"boost": 1.0
				}
			}],
			"filter": [{
				"bool": {
					"must": [{
						"range": {
							"@timestamp": {
								"from": "2018-05-17T06:45:32.452000",
								"to": "2018-05-17T06:48:50.719000",
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}],
					"disable_coord": false,
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}],
			"disable_coord": false,
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"sort": [{
		"@timestamp": {
			"order": "asc",
			"unmapped_type": "date"
		}
	}]
}

I just tried running this from the kibana console and got Error: RangeError: Maximum call stack size exceeded

EDIT: That seems like a browser related issue. From chrome that returns in 750ms!

dawiro · May 29, 2018, 8:38am

@Christian_Dahlqvist Any update? Would appreciate help with troubleshooting/understanding why the kernel is oom-killing the java process. it's worth noting btw that oom-kill was never invoked while running 2.4.6 and the heap size at that time was 6GB...

Christian_Dahlqvist · May 29, 2018, 2:03pm

I just noticed that you are running Packetbeat on the node as well. What happens if you move Packetbeat to a different host?

dawiro · May 31, 2018, 1:54pm

How much memory can packetbeat consume?

Christian_Dahlqvist · May 31, 2018, 2:30pm

I don’t know, but suspect it depends on traffic volume.

system · June 28, 2018, 2:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Client nodes killed by kernel OOM killer Elasticsearch	16	3238	June 5, 2019
Client node crash with OOM exception Elasticsearch	2	939	July 5, 2017
ElasticSearch client options Elasticsearch	3	430	July 6, 2017
Elasticsearch Cleint Nodes OOM Killed by Gargantuan Query Elasticsearch	3	693	June 27, 2019
Node not available exception is occurred after Out of memory error Elasticsearch	2	580	July 6, 2017

Client Nodes being oom-killed

Related topics