OutOfMemoryError[Java heap space]

Friendly greetings !
I have a 2 nodes cluster on EC2. (i had a 5 nodes cluster too with the same
kind problem).

Each weeks (sometimes once in 2 week and sometime twice per week) i have a
node crashing with this kind of error :
[2013-10-22 12:42:08,401][DEBUG][action.search.type ] [Aegis]
[kibana-int][4], node[KkhYhIvsQN6QmiAJEJ9HzA], [R], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@764568a2]
org.elasticsearch.search.SearchParseException: [kibana-int][4]:
from[-1],size[-1]: Parse Failure [Failed to parse source
[{"facets":{"1":{"date_histogram":{"field":"@timestamp","interval":"10m"},"facet_filter":{"
fquery":{"query":{"filtered":{"query":{"query_string":{"query":"startVideo"}},"filter":{"bool":{"must":[{"match_all":{}},{"range":{"@timestamp":{"from":1382272678894,"to":1382445478894}}},{"bool":{"must":[{"match
all":{}}]}}]}}}}}}},"2":{"date_histogram":{"field":"@timestamp","interval":"10m"},"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"quartile50"}},"filter":{"bool":{"must":[{"match
all":{}},{"range":{"@timestamp":{"from":1382272678894,"to":1382445478894}}},{"bool":{"must":[{"match_all":{}}]}}]}}}}}}},"3":{"date_histogram":{"field":"@timestamp","interval":"10m"},"facet_filter":{"fquery":{"qu
ery":{"filtered":{"query":{"query_string":{"query":"playCompleted"}},"filter":{"bool":{"must":[{"match_all":{}},{"range":{"@timestamp":{"from":1382272678894,"to":1382445478894}}},{"bool":{"must":[{"match_all":{}}
]}}]}}}}}}},"0":{"date_histogram":{"field":"@timestamp","interval":"10m"},"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"advertiserBillable"}},"filter":{"bool":{"must":[{"match_a
ll":{}},{"range":{"@timestamp":{"from":1382272678894,"to":1382445478894}}},{"bool":{"must":[{"match_all":{}}]}}]}}}}}}},"4":{"date_histogram":{"field":"@timestamp","interval":"10m"},"facet_filter":{"fquery":{"que
ry":{"filtered":{"query":{"query_string":{"query":"seekStart"}},"filter":{"bool":{"must":[{"match_all":{}},{"range":{"@timestamp":{"from":1382272678894,"to":1382445478894}}},{"bool":{"must":[{"match_all":{}}]}}]}
}}}}}}},"size":0}]]
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:561)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:464)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:449)
at
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:442)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:214)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:293)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:224)
at
org.elasticsearch.search.action.SearchServiceTransportAction$4.handleException(SearchServiceTransportAction.java:222)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:180)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:170)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:122)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException:
Facet [1]: (key) field [@timestamp] not found
at
org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
at
org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:92)
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:549)
... 35 more

then followed by out of memory error :

[2013-10-24 01:25:01,348][WARN ][index.merge.scheduler ] [Aegis]
[logstash-2013.10.23][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2013-10-24 01:25:01,349][WARN ][index.engine.robin ] [Aegis]
[logstash-2013.10.23][1] failed engine
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvider$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeSchedulerProvider.java:99)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2013-10-24 01:25:01,350][WARN ][cluster.action.shard ] [Aegis] sending
failed shard for [logstash-2013.10.23][1], node[KkhYhIvsQN6QmiAJEJ9HzA],
[P], s[INITIALIZING], reason [engine failure, message
[MergeException[java.lang.OutOfMemoryError: Java heap space]; nested:
OutOfMemoryError[Java heap space]; ]]
[2013-10-24 01:25:01,351][WARN ][cluster.action.shard ] [Aegis]
received shard failed for [logstash-2013.10.23][1],
node[KkhYhIvsQN6QmiAJEJ9HzA], [P], s[INITIALIZING], reason [engine failure,
message [MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2013-10-24 04:17:34,585][WARN ][http.netty ] [Aegis] Caught
exception while handling client http traffic, closing connection [id:
0x1650ccf8, /10.164.36.aaa:27357 => /10.80.141.bbb:9200]
java.lang.OutOfMemoryError: Java heap space

(i removed the last part of the ip)
it never recover, i have to kill -9 java.

I have this config :
aws:
access_key: ***
secret_key: ***
node:
auto_attributes: true

discovery:
type: ec2

discovery.ec2.groups: elasticsearch

i had the s3 gateway too, but it never worked.

here is the commandline show by ps :
/usr/lib/jvm/jre/bin/java -Xms4g -Xmx4g -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.path.home=/home/elasticsearch/elasticsearch -cp
:/home/elasticsearch/elasticsearch/lib/elasticsearch-0.90.5.jar:/home/elasticsearch/elasticsearch/lib/:/home/elasticsearch/elasticsearch/lib/sigar/
org.elasticsearch.bootstrap.ElasticSearch

i edited elasticsearch.in.sh to add :
ES_HEAP_SIZE=4g

the node have 8GB of memory.
Since i switched to 2 node and configured heap_size correctly the only node
that crashed is the node that's hosting kibana3.

ElasticHQ show this statistics :
2 Nodes
230 Total Shards
230 Successful Shards
23 Indices
427.207.138 Total Documents
425.1GB Total Size

inserting around 100 doc/s using a Loadbalancer to insert randomly on the 2
nodes. When one of the node crash data inserted are lost on both node until
i restart (so i have a nice hole in my data...)

Any idea ?
Thank you

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Version : 0.90.5, i had the same kind of problem with 0.90.3

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OpenJDK Runtime Environment (IcedTea6 1.11.11.90)
(amazon-62.1.11.11.90.55.amzn1-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

That may be a problem ? should i use oracle jdk ?

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm running Eclipse memory analyzer to analyze the hprof dump :

<#> Problem Suspect 1

One instance of "org.elasticsearch.common.cache.LocalCache" loaded by*"sun.misc.Launcher$AppClassLoader
@ 0x6f8394d78"* occupies 1,944,174,448 (45.35%) bytes. The memory is
accumulated in one instance of*
"org.elasticsearch.common.cache.LocalCache$Segment[]"* loaded by*"sun.misc.Launcher$AppClassLoader
@ 0x6f8394d78"*.

Keywords
org.elasticsearch.common.cache.LocalCache
org.elasticsearch.common.cache.LocalCache$Segment[]
sun.misc.Launcher$AppClassLoader @ 0x6f8394d78
Details » <pages/20.html>
<#> Problem Suspect 2

2,505 instances of "org.apache.lucene.index.SegmentCoreReaders", loaded by
"sun.misc.Launcher$AppClassLoader @ 0x6f8394d78" occupy 993,789,272
(23.18%)
bytes.

Keywords
sun.misc.Launcher$AppClassLoader @ 0x6f8394d78
org.apache.lucene.index.SegmentCoreReaders
Details » <pages/24.html>
<#> Problem Suspect 3

4 instances of "org.elasticsearch.common.cache.LocalCache$Segment",
loaded by*"sun.misc.Launcher$AppClassLoader @ 0x6f8394d78"* occupy 614,958,128
(14.35%)
bytes.

Biggest instances:

  • org.elasticsearch.common.cache.LocalCache$Segment @
    0x6f8602b70 - 174,528,264 (4.07%) bytes.
  • org.elasticsearch.common.cache.LocalCache$Segment @
    0x6f8602850 - 162,665,048 (3.79%) bytes.
  • org.elasticsearch.common.cache.LocalCache$Segment @
    0x6f8602cb0 - 146,308,472 (3.41%) bytes.
  • org.elasticsearch.common.cache.LocalCache$Segment @
    0x6f8602e40 - 131,456,344 (3.07%) bytes.

Keywords
org.elasticsearch.common.cache.LocalCache$Segment
sun.misc.Launcher$AppClassLoader @ 0x6f8394d78
Details » <pages/26.html>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok...
I do some request on kibana, including big request.
Everything is ok but the "% Heap Used:" show in ElasticHQ is growing slowly.

Then, suddendly, it's all frozen.
Right now the request is frozen, all plugins are unresponsive.

No log, nothing...

As i'm writing this mail i just got a log :
[2013-10-24 12:44:51,568][INFO ][discovery.ec2 ] [Anvil]
master_left [[LeBeau,
Remy][I0U0qBYxT5G5TlaT5q6Jng][inet[/10.100.231.aaa:9300]]{aws_availability_zone=us-east-1a}],
reason [do not exists on master, act as master failure]
[2013-10-24 12:44:51,778][INFO ][cluster.service ] [Anvil] master
{new
[Anvil][14_v_SU9RQW8QxMmO90LMA][inet[/10.80.141.bbb:9300]]{aws_availability_zone=us-east-1a},
previous [LeBeau,
Remy][I0U0qBYxT5G5TlaT5q6Jng][inet[/10.100.231.aaa:9300]]{aws_availability_zone=us-east-1a}},
removed {[LeBeau,
Remy][I0U0qBYxT5G5TlaT5q6Jng][inet[/10.100.231.aaa:9300]]{aws_availability_zone=us-east-1a},},
reason: zen-disco-master_failed ([LeBeau,
Remy][I0U0qBYxT5G5TlaT5q6Jng][inet[/10.100.231.aaa:9300]]{aws_availability_zone=us-east-1a})

And the master :
[2013-10-24 12:42:56,026][INFO ][cluster.service ] [LeBeau, Remy]
removed
{[Anvil][14_v_SU9RQW8QxMmO90LMA][inet[/10.80.141.bb:9300]]{aws_availability_zone=us-east-1a},},
reason:
zen-disco-node_failed([Anvil][14_v_SU9RQW8QxMmO90LMA][inet[/10.80.141.bb:9300]]{aws_availability_zone=us-east-1a}),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

Anvil is the node hosting kibana.
Lebeau is the master.

Right now ElasticHQ is back again, it show it show only one node (anvil)
with 99.99% Heap Used.

So : They just split...
The master (lebeau) removed Anvil because of the timeout (see log above)
The Slave (Anvil) detected that the master left and became master (So we
have a nice split brain with 2 master and 0 slave now ...)

Just because i was playing with kibana.

3mn later the split, out of memory error :

org.elasticsearch.transport.SendRequestTransportException: [LeBeau,
Remy][inet[/10.100.231.bbb:9300]][search/phase/query]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:204)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:208)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:293)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$3.onFailure(TransportSearchTypeAction.java:224)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:205)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:203)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:186)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [LeBeau,
Remy][inet[/10.100.231.bbb:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:825)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:518)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 14 more
[2013-10-24 12:49:38,697][DEBUG][action.bulk ] [Anvil]
[logstash-2013.10.24][4] failed to execute bulk item (index) index
{[logstash-2013.10.24][collect][nwqWhn15QBmxo-maVmDBog],
source[{"eventName":"quartile75","fingerprint":"1690954033","sessionId":"92f819e1ba13e1561ddc02032601bc4a","sequenceId":"6","clientTimestamp":"1382618897621","campaignId":"27251","campaignName":"Scratch
To Win -
McDonald's","buzzId":"1237728","channelId":"86132","channelName":"Appcash","advertiserId":"81527","pageDomain":"www.appcash.fr","pagePath":"/ebuzzing/mcdo.php","pageProtocol":"http","version":"3","geoCountry":"FR","geoRegion":"A3","geoCity":"Amilly","geoLatitude":"47.972801208496094","geoLongitude":"2.771899938583374","category":"Jeune","duration":"30","mediaProvider":"youtube","playerTarget":"html","currentTime":"22","buzzplayerVersion":"76fc35d","platformVersion":"e5ca6fa","trackerVersion":"c2ec4ca","playerSize":"420x257","youtubeId":"oj7C53EVrZQ","S_v":"0.3.0","userId":"69b10295-8710-48a4-abdb-c1a7f5493d2c","user-agent":"Mozilla/5.0
(iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like
Gecko)
Mobile/11A466","ip":"77.196.247.170","@timestamp":"2013-10-24T12:48:18+00:00"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:533)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:320)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:401)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:533)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:418)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.common.compress.BufferRecycler.allocOutputBuffer(BufferRecycler.java:77)
at
org.elasticsearch.common.compress.lzf.LZFCompressedStreamOutput.(LZFCompressedStreamOutput.java:40)
at
org.elasticsearch.common.compress.lzf.LZFCompressor.streamOutput(LZFCompressor.java:140)
at
org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:296)
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:385)
at
org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:234)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:504)
... 9 more

...

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Why is there this error on Anvil ?
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [LeBeau,
Remy][inet[/10.100.231.bbb:9300]] Node not connected

Anvil ejected Lebeau and became the new master 3mn before this...

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The answer i got on IRC :

  1. Use Oracle JVM 7
  2. use at least xlarge VM (16GB Ram)
  3. disable _all
  4. optimize index (already doing it)

--
Laurent Laborde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.