We've been running into a similar issue recently -- and I don't think it has to do with indexing data as @Srinath_C's thread link suggests.
In our case we're running 4 x c3.8xlarge
instances with 10TB of SSD-backed provisioned-iops volumes attached to each host (8000 IOPS per host). We index upwards of ~20k events per second peak during the day, and our indexed documents range quite a bit in field types, counts, etc. We have 331 indexed fields in our current day of events, for example.
(We store 30 days of data, at ~300GB/day for a total of ~9-12TB of used storage and ~9 billion documents)
Our hosts have 60+GB of memory, and we allocate 29.5GB of that to the HEAP on-startup, leaving ~30GB for lucene indexing and other local processes.
/usr/lib/jvm/java-7-oracle/bin/java -Xms29696m -Xmx29696m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC
Our elasticsearch.yml file:
---
bootstrap:
mlockall: true
cloud:
aws:
access_key: XXX
region: us-east-1
secret_key: XXX
node:
auto_attributes: true
cluster:
name: flume-elasticsearch-production_vpc-useast1
routing:
allocation:
allow_rebalance: indices_all_active
cluster_concurrent_rebalance: 2
node_concurrent_recoveries: 8
same_shard:
host: true
discovery:
ec2:
any_group: false
groups:
- default
- OPS3-FLUME-ES
host_type: private_ip
type: ec2
zen:
ping:
multicast:
enabled: false
unicast:
enabled: false
gateway:
recover_after_time: 5m
index:
number_of_replicas: 1
number_of_shards: 10
indices:
breaker:
fielddata:
limit: 85%
fielddata:
cache:
size: 25%
recovery:
max_bytes_per_sec: -1
store:
throttle:
max_bytes_per_sec: -1
translog:
flush_threshold_size: 1024mb
node:
name: prod-flume-es-useast1-109-i-XXX-flume-elasticsearch-production_vpc-useast1
path:
data: /mnt/elasticsearch/flume-elasticsearch-production_vpc-useast1
In our case, we've had the cluster crash now at least 5 times. Each time it crashes we've observed that one or more of our nodes go into a JVM garbage collection loop. When that happens, the cluster management code loses connectivity to the various nodes, new document indexing starts to fail, and we effectively fall over.
We've finally narrowed this down we think to the indices.fielddata.cache.size
setting. We had originally it at 75%
. We tried dropping it down to 50%
and still had a failure yesterday. We're now running at 25%
as you can see in the config above and we're hoping everything works.
Our theory at this point is that we have no problem indexing documents, but because our cluster is running combined data
and master
nodes,
when the JVM GC gets involved all hell breaks loose. We think that users pull up a dashboard in Kibana (we make heavy use of structured logs), the search indexes a ton of field data (which is allowed by the large indices.fielddata.cache.size
setting), and we ultimately force the instance into a garbage collection loop.
The questions we havn't answered yet are..
-
What else is using mass amounts of memory out of the HEAP? If we have 29GB available and we allocate 14.5GB to field data
, what is using the other 14.5GB?
-
We've seen full cluster failures (effectively access to the cluster API hangs, indexing freezes, etc) even when only a single node in the cluster begins the GC loop. Why? If we move to dedicated master
nodes (planned for next week) and a data
node starts into its GC loop, will the cluster continue to operate properly?
-
Why is the indices.fielddata.cache.expires
setting experimental and marked for deprecation? It seems wildly helpful to clear out this cache of expired documents and data whenever appropriate? Am I missing something here?