Client Nodes being oom-killed


#1

Hi,
I'm running elasticsearch 5.6.3 and am seeing my client nodes oom-killed periodically.This is odd because heap size is set to 4GB while the server itself has 8GB of RAM and nothing else is running on the node aside form packetbeat.

Can someone help me out with getting more info on these nodes on and off-heap memory usage so I can understand why oom-kill is kicking?

Regards,
D


(Mark Walkom) #2

If you are not running Monitoring (from X-Pack) then you should install that to see what is happening.


#3

Is that available in the free version?


(Mark Walkom) #4

Yep, it's free!


#5

There's likely to be a lead time until I could implement that. Is there anything I can do in the meantime?


(Christian Dahlqvist) #6

If you can provide us with the output of the cluster stats API we would have a better view of the cluster and what is using heap.


#7

Cluster stats as requested:

{
  "_nodes" : {
    "total" : 15,
    "successful" : 15,
    "failed" : 0
  },
  "cluster_name" : "elk_prd",
  "timestamp" : 1527165795887,
  "status" : "green",
  "indices" : {
    "count" : 39,
    "shards" : {
      "total" : 742,
      "primaries" : 371,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 20,
          "avg" : 19.025641025641026
        },
        "primaries" : {
          "min" : 1,
          "max" : 10,
          "avg" : 9.512820512820513
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 5868458697,
      "deleted" : 7
    },
    "store" : {
      "size_in_bytes" : 6724892904181,
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 1015712,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 16851845878,
      "total_count" : 785549301,
      "hit_count" : 46445247,      "miss_count" : 739104054,
      "cache_size" : 110755,
      "cache_count" : 511818,
      "evictions" : 401063
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 16516,
      "memory_in_bytes" : 11085201082,
      "terms_memory_in_bytes" : 9108751709,
      "stored_fields_memory_in_bytes" : 1370464368,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 725696,
      "points_memory_in_bytes" : 196900069,
      "doc_values_memory_in_bytes" : 408359240,
      "index_writer_memory_in_bytes" : 718227640,
      "version_map_memory_in_bytes" : 22902486,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 9223372036854775807,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 15,
      "data" : 10,
      "coordinating_only" : 1,
      "master" : 3,
      "ingest" : 2
    },
    "versions" : [
      "5.6.3"
    ],
    "os" : {
      "available_processors" : 47,
      "allocated_processors" : 47,
      "names" : [
        {
          "name" : "Linux",
          "count" : 15
        }
      ],
      "mem" : {
        "total_in_bytes" : 366583300096,
        "free_in_bytes" : 12803936256,
        "used_in_bytes" : 353779363840,        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 395
      },
      "open_file_descriptors" : {
        "min" : 462,
        "max" : 720,
        "avg" : 633
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 3275809847,
      "versions" : [
        {
          "version" : "1.8.0_45",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.45-b02",
          "vm_vendor" : "Oracle Corporation",
          "count" : 15
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 97341295728,
        "heap_max_in_bytes" : 186421411840
      },
      "threads" : 907
    },
    "fs" : {
      "total_in_bytes" : 14962766397440,
      "free_in_bytes" : 8197970984960,
      "available_in_bytes" : 8040249016320,
      "spins" : "true"
    },
    "plugins" : [
      {
        "name" : "discovery-ec2",
        "version" : "5.6.3",
        "description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname" : "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "repository-s3",
        "version" : "5.6.3",        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "has_native_controller" : false
      },
      {
        "name" : "mapper-size",
        "version" : "5.6.3",
        "description" : "The Mapper Size plugin allows document to record their uncompressed size at index time.",
        "classname" : "org.elasticsearch.plugin.mapper.MapperSizePlugin",
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 15
      },
      "http_types" : {
        "netty4" : 15
      }
    }
  }
}

Are there any logging settings I can enable on the client nodes to highlight behaviours/issues in advance of oom-kill dropping the axe?

Regards,
D


#8

@Christian_Dahlqvist see above :slight_smile:


(Christian Dahlqvist) #9

Do you have any non-standard settings for your nodes, e.g. regarding cache sizes etc? What types of queries are you running?


#10

@Christian_Dahlqvist I have thread_pool.search.queue_size set to 1500 because the default search queue size was exceeded. The client nodes themselves are configured in quite a vanilla manner.
I'm not intimately familiar with most of the queries hitting the cluster; it's predominantly kibana traffic I'd expect. However, we do have some scripted bulk searches going on, one of which has been reported as running v.slow in the last week or so (oom-kill has been going on since upgrading to 5.6.3 in April). An example of that query looks like this:

{'sort': [{'@timestamp': {'unmapped_type': 'date', 'order': 'asc'}}], 'query': {'bool': {'filter': {'bool': defaultdict(<type 'list'>, {'must': [{'range': {'@timestamp': {'gte': datetime.datetime(2018, 5, 16, 0, 6, 32, 954000), 'lte': datetime.datetime(2018, 5, 16, 0, 7, 25, 652000)}}}]})}, 'must': {'query_string': {'query': 'message: DiagnosticsEvent OR message: StatusEvent'}}}}, 'size': 10000}

This query gets run multiple times over different time windows to collect the log lines required.

Regards,
D


(Christian Dahlqvist) #11

The increased queue size will result in more heap being used. It is also hard for me to know how your queries affect heap usage. It looks like some of the aggregations have a large size parameter, that could result in a lot of buckets, driving heap usage.

It is possible that you may need to increase the amount of heap you have on these nodes. As dedicated coordinating nodes do not use the file system page cache, you can increase heap to 75% of available RAM (maybe even a bit higher?) on these nodes. The same applies to any dedicated master nodes, even though that does not seen to be the problem here.


#12

@Christian_Dahlqvist These nodes are not running out of heap, they're being oom-killed by the linux kernel


#13

@Christian_Dahlqvist Here's a better example of the query I showed you above (as reported by the data node):

{
	"size": 10000,
	"query": {
		"bool": {
			"must": [{
				"query_string": {
					"query": "message: DiagnosticsEvent OR message: StatusEvent",
					"fields": [],
					"use_dis_max": true,
					"tie_breaker": 0.0,
					"default_operator": "or",
					"auto_generate_phrase_queries": false,
					"max_determinized_states": 10000,
					"enable_position_increments": true,
					"fuzziness": "AUTO",
					"fuzzy_prefix_length": 0,
					"fuzzy_max_expansions": 50,
					"phrase_slop": 0,
					"escape": false,
					"split_on_whitespace": true,
					"boost": 1.0
				}
			}],
			"filter": [{
				"bool": {
					"must": [{
						"range": {
							"@timestamp": {
								"from": "2018-05-17T06:45:32.452000",
								"to": "2018-05-17T06:48:50.719000",
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}],
					"disable_coord": false,
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}],
			"disable_coord": false,
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"sort": [{
		"@timestamp": {
			"order": "asc",
			"unmapped_type": "date"
		}
	}]
}

I just tried running this from the kibana console and got Error: RangeError: Maximum call stack size exceeded

EDIT: That seems like a browser related issue. From chrome that returns in 750ms!


#14

@Christian_Dahlqvist Any update? Would appreciate help with troubleshooting/understanding why the kernel is oom-killing the java process. it's worth noting btw that oom-kill was never invoked while running 2.4.6 and the heap size at that time was 6GB...


(Christian Dahlqvist) #15

I just noticed that you are running Packetbeat on the node as well. What happens if you move Packetbeat to a different host?


#16

How much memory can packetbeat consume?


(Christian Dahlqvist) #17

I don’t know, but suspect it depends on traffic volume.


(system) #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.