Am I pushing my cluster's resource limits? (OOME)


(Chris Neal) #1

Hi everyone,

I have a 2 node development cluster in EC2 on c3.2xlarge AMIs. That means
8 vCPUs, 15GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index, on
average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

The heap sits about 7GB used.

The issue is that nodes run out of memory. It has run out on merges, and
shard flushing:

[2014-07-07 04:38:47,900][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140704][0] failed to flush after
setting shard to inactive
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140704][0] Flush failed

[2014-07-07 09:16:31,492][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140707][0] received shard failed
for [derbysoft-20140707][0], node[PAqdsZSnSvewPWPPZbQTyw], [P], s[STARTED],
indexUUID [E1WJZUtbTQi8fs29cJbfjQ], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]

I have:

  • changed the cache field type from 'resident' to 'soft'
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is disabled,
    and all fields in the mapping are set to "not_analyzed".

Even with that, the previous days index reports 4.4GB of Field Data.
The Index rate is on average about 4K/s, with no peaks higher than 6.75K/s.

I'm wondering if I'm just pushing the limits of my 2 8GB RAM instances with
an index of this size? Is there something else I could tweak to keep the
server from running out of memory?

I appreciate everyone who actually made it to the end of this long email,
but I wanted to make sure I provided all the relevant data I could think
of. :slight_smile:

Thank you very much for your time.
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dph_vqTUQjag6Hey3atw4BDKhX%2B2gLi0zOYv8n2k8_jskQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

Which java release are you running, OpenJDK or Oracle?
What's your query rate like?
Check out monitoring plugins like ElasticHQ and marvel, they will give you
insight around cluster state and statistics.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 July 2014 03:12, Chris Neal chris.neal@derbysoft.net wrote:

Hi everyone,

I have a 2 node development cluster in EC2 on c3.2xlarge AMIs. That means
8 vCPUs, 15GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index,
on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

The heap sits about 7GB used.

The issue is that nodes run out of memory. It has run out on merges, and
shard flushing:

[2014-07-07 04:38:47,900][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140704][0] failed to flush after
setting shard to inactive
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140704][0] Flush failed

[2014-07-07 09:16:31,492][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140707][0] received shard failed
for [derbysoft-20140707][0], node[PAqdsZSnSvewPWPPZbQTyw], [P], s[STARTED],
indexUUID [E1WJZUtbTQi8fs29cJbfjQ], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]

I have:

  • changed the cache field type from 'resident' to 'soft'
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is disabled,
    and all fields in the mapping are set to "not_analyzed".

Even with that, the previous days index reports 4.4GB of Field Data.
The Index rate is on average about 4K/s, with no peaks higher than 6.75K/s.

I'm wondering if I'm just pushing the limits of my 2 8GB RAM instances
with an index of this size? Is there something else I could tweak to keep
the server from running out of memory?

I appreciate everyone who actually made it to the end of this long email,
but I wanted to make sure I provided all the relevant data I could think
of. :slight_smile:

Thank you very much for your time.
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph_vqTUQjag6Hey3atw4BDKhX%2B2gLi0zOYv8n2k8_jskQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph_vqTUQjag6Hey3atw4BDKhX%2B2gLi0zOYv8n2k8_jskQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624YJOi6paS4nYcDWQTOoA8_GRuD2aEArX6FUtdD2P0p6rA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Chris Neal) #3

Thanks for the feedback Mark.

To answer your questions:
I'm running version 1.7.0_55 of Oracle's JDK.
The query rate is none. There is just loading/indexing of data going into
the cluster, that's it.

I've been running ElasticHQ, Marvel, and the Head plugin for a few weeks
watching things, and they are very helpful for sure. All the statistics
look fine to me. Once the heap gets full of data (Field data mostly), the
VMs go in to constant GC loops, but cannot free anything, and eventually
OOME.

I guess my hunch is that the size of the index each day is just more than
what 2 8GB heaps can handle for merges...but it's just a hunch that I was
looking for validation on :slight_smile:

I had tweaked everything I could find grom reading online, and applied
appropriate default mappings on all the fields, and was still having the
problem. I was wondering if there was anything that I missed as far as
configuration changes, etc that might help with the memory pressure, or if
I had done the things that I could, and what's left is scaling out with
more RAM or more servers.

Thanks again for all the help!
Chris

On Thu, Jul 10, 2014 at 6:01 PM, Mark Walkom markw@campaignmonitor.com
wrote:

Which java release are you running, OpenJDK or Oracle?
What's your query rate like?
Check out monitoring plugins like ElasticHQ and marvel, they will give you
insight around cluster state and statistics.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 11 July 2014 03:12, Chris Neal chris.neal@derbysoft.net wrote:

Hi everyone,

I have a 2 node development cluster in EC2 on c3.2xlarge AMIs. That
means 8 vCPUs, 15GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index,
on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

The heap sits about 7GB used.

The issue is that nodes run out of memory. It has run out on merges, and
shard flushing:

[2014-07-07 04:38:47,900][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140704][0] failed to flush after
setting shard to inactive
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140704][0] Flush failed

[2014-07-07 09:16:31,492][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-45] [derbysoft-20140707][0] received shard failed
for [derbysoft-20140707][0], node[PAqdsZSnSvewPWPPZbQTyw], [P], s[STARTED],
indexUUID [E1WJZUtbTQi8fs29cJbfjQ], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]

I have:

  • changed the cache field type from 'resident' to 'soft'
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is
    disabled, and all fields in the mapping are set to "not_analyzed".

Even with that, the previous days index reports 4.4GB of Field Data.
The Index rate is on average about 4K/s, with no peaks higher than
6.75K/s.

I'm wondering if I'm just pushing the limits of my 2 8GB RAM instances
with an index of this size? Is there something else I could tweak to keep
the server from running out of memory?

I appreciate everyone who actually made it to the end of this long email,
but I wanted to make sure I provided all the relevant data I could think
of. :slight_smile:

Thank you very much for your time.
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph_vqTUQjag6Hey3atw4BDKhX%2B2gLi0zOYv8n2k8_jskQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph_vqTUQjag6Hey3atw4BDKhX%2B2gLi0zOYv8n2k8_jskQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624YJOi6paS4nYcDWQTOoA8_GRuD2aEArX6FUtdD2P0p6rA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEM624YJOi6paS4nYcDWQTOoA8_GRuD2aEArX6FUtdD2P0p6rA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DpgCrda6YZZxXwCt8qekp_WnDGiwoZUf3roQvdry9GJuUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4