We're attempting to create a new Elasticsearch cluster for indexing URLs, but have run into a memory leak when turning replication on for our indices.
The current setup is: 5 x m2.2xlarge, 4 TB mounted on EBS per node (not Provisioned IOPs).
We create one index per day, and will keep the past 90 days around for searching. We have been been performing bulk inserts with routing enabled, 1 day at a time, and have been successful in loading all 90 days. This ended up being approximately 313 million documents. I had inserted with the number of replicas per index set to 0 to increase our bulk insertion rate.
I then started changing the number of replicas per index to 1, one index at a time. I was able to successfully create the replicas for about 70 of the shards (i.e. about 65 or 70 days), but then ran out of heap space.
We are planning to bulk insert about 2-4 millions records per day in 10 minute intervals, so I would appreciate any advice on the validity of our configuration so far. In particular, we would like to know if there's any known memory leaks with shard replication or bulk inserts.
Our configuration:
Ubuntu 12.04 LTS
Java 7 u51 (I am aware of https://groups.google.com/forum/#!msg/elasticsearch/D4WNQZSvqSU/zo7ancelKi4J and am doing a rolling restart of the cluster as we speak to move to Java 7 u25).
Marvel was installed on each node, but in order to simplify our setup, I will be removing it during the aforementioned cluster restart.
Elasticsearch 1.0.0
"version" : {
"number" : "1.0.0",
"build_hash" : "a46900e9c72c0a623d71b54016357d5f94c8ea32",
"build_timestamp" : "2014-02-12T16:18:34Z",
"build_snapshot" : false,
"lucene_version" : "4.6"
},
Settings applied for our bulk insert:
{
"index" : {
"merge.policy.max_merge_at_once" : 4,
"merge.policy.segments_per_tier" : 20,
"refresh_interval" : "-1" # I will be setting this back to 1s when our backfill/replicas are done
}
}
{
"transient" : {
"index.merge.policy.merge_factor" : 30,
"threadpool.bulk.queue_size" : -1,
"index.merge.scheduler.max_thread_count" : 5
}
}
Our Java configuration variables (those that are different from the default /etc/default/elasticsearch in the .deb):
JAVA_HOME=/usr/lib/jvm/java-1.7.0_25-oracle (this was Oracle's Java 7 u51, being backed down during the restart)
ES_HEAP_SIZE=18g
MAX_OPEN_FILES=256000
From a running instance:
/usr/lib/jvm/java-1.7.0_25-oracle/bin/java -Xms18g -Xmx18g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.pidfile=/var/run/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.0.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/ -Des.default.config=/etc/elasticsearch/elasticsearch.yml -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
The log message I saw during the OutOfMemoryError:
[2014-04-09 14:17:28,393][WARN ][cluster.action.shard ] [esearch16] [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
et[/10.145.167.184:9300]] into [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]]; nested: RemoteTransportException[[esearch16][inet[/10.145
.167.184:9300]][index/shard/recovery/startRecovery]]; nested: RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
9:9300]][index/shard/recovery/prepareTranslog]]; nested: EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock]; ]]
[2014-04-09 14:24:47,111][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-03][4] received shard failed for [domain_url_2014-01-03][4], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 14:26:06,104][WARN ][cluster.action.shard ] [esearch16] [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
et[/10.145.167.184:9300]] into [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]]; nested: RemoteTransportException[[esearch16][inet[/10.145
.167.184:9300]][index/shard/recovery/startRecovery]]; nested: RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
9:9300]][index/shard/recovery/prepareTranslog]]; nested: EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock]; ]]
[2014-04-09 14:26:48,562][INFO ][cluster.metadata ] [esearch16] updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
[2014-04-09 14:27:27,235][INFO ][cluster.metadata ] [esearch16] updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
[2014-04-09 14:37:01,359][INFO ][cluster.metadata ] [esearch16] [.marvel-2014.04.09] update_mapping [shard_event] (dynamic)
[2014-04-09 14:37:01,531][INFO ][cluster.metadata ] [esearch16] [.marvel-2014.04.09] update_mapping [routing_event] (dynamic)
[2014-04-09 14:40:51,469][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-01][2] received shard failed for [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 14:41:00,353][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-03-11][2] received shard failed for [domain_url_2014-03-11][2], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
ndexUUID [HuQzTDCmTMeS3He3DumnOg], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 15:04:32,504][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-03][2] received shard failed for [domain_url_2014-01-03][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 15:12:13,529][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-01][2] received shard failed for [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 15:39:24,021][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-03][1] received shard failed for [domain_url_2014-01-03][1], node[4ft2nd1lRE-BdvL2iYGIkg], relocating [BKCZOz
tRRP6FXVKJSkT_oA], [R], s[INITIALIZING], indexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [Failed to start shard, message [RecoveryFailedException[[domain_url_2014-01-03][1]: Recovery failed from [esearch15]
[EkR2xgpURrunkxrRnpkzYQ][esearch15.tlys.us][inet[ip-10-185-171-146.ec2.internal/10.185.171.146:9300]] into [esearch14][4ft2nd1lRE-BdvL2iYGIkg][esearch14.tlys.us][inet[ip-10-184-39-23.ec2.internal/10.18
4.39.23:9300]]]; nested: RemoteTransportException[[esearch15][inet[/10.185.171.146:9300]][index/shard/recovery/startRecovery]]; nested: RecoveryEngineException[[domain_url_2014-01-03][1] Phase[2] Execu
tion failed]; nested: RemoteTransportException[[esearch14][inet[/10.184.39.23:9300]][index/shard/recovery/prepareTranslog]]; nested: OutOfMemoryError[Java heap space]; ]]
[2014-04-09 15:42:51,176][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-06][1] received shard failed for [domain_url_2014-01-06][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], i
ndexUUID [51jdwEMrTGKtTpA90ZjXiQ], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
[2014-04-09 15:54:42,711][DEBUG][action.admin.cluster.stats] [esearch16] failed to execute on node [4ft2nd1lRE-BdvL2iYGIkg]
org.elasticsearch.transport.RemoteTransportException: [esearch14][inet[/10.184.39.23:9300]][cluster/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException: [domain_url_2014-01-01][1] CurrentState[CLOSED]
at org.elasticsearch.index.engine.internal.InternalEngine.ensureOpen(InternalEngine.java:913)
at org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1130)
at org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:130)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:281)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:272)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.fst.BytesStore.(BytesStore.java:62)
at org.apache.lucene.util.fst.FST.(FST.java:366)
at org.apache.lucene.util.fst.FST.(FST.java:301)
at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.(BlockTreeTermsReader.java:481)
at org.apache.lucene.codecs.BlockTreeTermsReader.(BlockTreeTermsReader.java:175)
at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437)
at org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat$BloomFilteredFieldsProducer.(BloomFilterPostingsFormat.java:131)
at org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat.fieldsProducer(BloomFilterPostingsFormat.java:102)
at org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat.fieldsProducer(Elasticsearch090PostingsFormat.java:79)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:195)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:115)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:235)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:382)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
at org.apache.lucene.search.XSearcherManager.(XSearcherManager.java:94)
at org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1462)
at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:801)
at org.elasticsearch.index.engine.internal.InternalEngine.updateIndexingBufferSize(InternalEngine.java:223)
at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:201)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:437)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
... 3 more
[2014-04-09 15:54:51,827][WARN ][cluster.action.shard ] [esearch16] [domain_url_2014-01-01][1] received shard failed for [domain_url_2014-01-01][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], indexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d2d66060-d0ac-49bf-b9ab-f4157ac3a4d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.