Elasticsearch version: 2.3.5
Plugins installed: [head, kopf, cloud_aws, marvel, license]
JVM version: 1.8.0_51
OS version: CentOS 6.6
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Cluster has 21 nodes. Previous version (1.7.2) worked with no issues for more than a year.
What we see is huge spike in write IOPS per node (500 writes/s before 2500 writes/s now). Around noon we have a script that reads a lot of data from this cluster and our read IOPS go high as well. We are running on AWS and the volume has only 3,000 max. With all those IOPS we hit the limit of 3,000 and at this point Elastisearch goes down.
When we look at all ES nodes and get the status of Elasticsearch service we get this: elasticsearch dead but subsys locked. When we try to start ES backup we get this:
Starting elasticsearch: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000429990000, 15408234496, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 15408234496 bytes for committing reserved memory.
An error report file with more information is saved as:
So apparently init script does not correctly report Elasticsearch status - ES is still running but it is in a bad state. So we have to kill ES manually with kill -9 (which is not fun at all on 21 servers) and only then we can start ES back online.
So with that multiple questions:
Why does new ES has so much more write IOPS? How do we prevent this?
One possibility is that auto-throttling is in place? We used to have this setting in our previous cluster:
indices.store.throttle.max_bytes_per_sec = "100mb"
So 2.3.5 apparently has auto-throttling enabled by default. Since it is impossible to delete our previous setting - does new ES just ignores it? Or how do we delete it? How can we disable auto-throttling if it causes the issue?
What is wrong with init script? We use official rpm and it seems there is a bug with init script - it does not correctly detect hung process. How can we fix that?
Thanks a lot!