We have a reasonably large ES cluster and we're struggling to understand our Java HEAP usage. Over the last few weeks, nearly every single day during our heaviest bulk indexing period we are seeing JVM garbage collection loops when the ES cluster starts receiving some Kibana searches.
At a high level, our cluster looks like this:
- 3 x Master Nodes: c3.large instances
- 8 x Data Nodes: 3.8xlarge w/ 5TB Provisioned IOPS EBS (4000 IOPS per node)
- 8 x Client Nodes: Flume Indexing nodes that dump data to the local ES client
- 2 x Client Nodes: Kibana Log Searching nodes
Data? We have that ... 15TB of storage over 30 days of data. Roughly ~300-400GB/day with about 300 million events stored per day.
Fields? We have those too. ~300-400 unique fields are added to the schema per day. I know, its more than ideal .. but they are extremely valuable to our engineers, so we're not going to be changing that any time soon.
Finally, we have metrics. We're not using Marvel, but we are collecting all the metrics via Collectd and graphing them.
bootstrap: mlockall: true cloud: aws: access_key: XXX region: us-east-1 secret_key: XXX node: auto_attributes: true cluster: name: flume-elasticsearch-production_vpc-useast1 routing: allocation: allow_rebalance: indices_all_active cluster_concurrent_rebalance: 2 node_concurrent_recoveries: 8 same_shard: host: true discovery: ec2: any_group: false groups: - default - OPS3-FLUME-ES host_type: private_ip type: ec2 zen: minimum_master_nodes: 2 ping: multicast: enabled: false unicast: enabled: false gateway: recover_after_time: 5m index: number_of_replicas: 1 number_of_shards: 10 indices: breaker: fielddata: limit: 85% fielddata: cache: size: 25% recovery: max_bytes_per_sec: -1 store: throttle: max_bytes_per_sec: -1 translog: flush_threshold_size: 1024mb node: data: false master: true name: prod-flume-es-master-useast1-114-i-c199bd6d-flume-elasticsearch-production_vpc-useast1 path: data: /mnt/elasticsearch/flume-elasticsearch-production_vpc-useast1
You can see here that we hit our JVM heap max a few times -- yet the fielddata cache size is definitely not the main culprit.
What I'm looking for here is help figuring out where the rest of the HEAP is being used.. What metric are we looking for that will help us figure out the missing memory here?