OutOfMemoryError Leading To IndexShardMissingException

Hi,

I am experiencing OutOfMemoryErrors on nodes in our ElasticSearch cluster
once the collective size of our indexes grows to a certain point.

We are running the following software:
ES 0.19.10
CentOS release 6.2
Oracle JVM 1.6.0_33

JVM settings:
wrapper.java.additional.3=-Xss256k
wrapper.java.additional.4=-XX:+UseParNewGC
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly

Our cluster consists of 6 nodes with the following configuration:
24GB system memory
12GB JVM heap

Currently we are rolling weekly indexes with the following settings:

  • index.cache.field.expire: 10m
  • index.refresh_interval: 60s
  • index.number_of_replicas: 1
  • index.cache.field.max_size: 50000
  • index.number_of_shards: 5
  • index.routing.allocation.total_shards_per_node: 2
  • index.cache.field.type: soft

We index ~100-300 docs/sec. In one week our index sizes are ~95M documents
with overall index sizes around 1.3 GB. We are currently using templates to
control index settings when new indexes are created. We had accumulated 3
indexes and were in the process of rolling to the fourth when the OOMEs
happened. By the time we caught the problem and did a cluster restart both
the new index and one we had rolled from had shards which were corrupt and
could not be allocated on the cluster.

My questions are as follows

  • Are there JVM/kernel settings that could help prevent OOMEs such as this
    by perhaps being more aggressive at garbage collection?

  • Are there index or cluster settings that would help prevent corruption of
    shards in this situation?

  • Is there any way to reduce the overhead of rolling to a new index?

I would also add that we have negligable query load - our field cache sizes
are 200-600mb in general. And we are currently trying to procure more
memory for the cluster. Log entries from the crash are below.

-drew

[2012-11-11 16:23:46,627][WARN ][monitor.jvm ] [esn-05]
[gc][ParNew][263760][51242] duration [2s], collections [1]/[2.2s], total
[2s]/[27.5m], memory [11.5gb]->[11.4gb]/[11.9gb], all_pools {[Code Cache]
[9.6mb]->[9.6mb]/[48mb]}{[Par Eden Space]
[173.3mb]->[773.7kb]/[216.3mb]}{[Par Survivor Space]
[27mb]->[27mb]/[27mb]}{[CMS Old Gen] [11.3gb]->[11.4gb]/[11.7gb]}{[CMS Perm
Gen] [47.3mb]->[47.3mb]/[82mb]}
[2012-11-11 17:53:20,090][WARN ][monitor.jvm ] [esn-05]
[gc][ParNew][269123][52347] duration [1.9s], collections [1]/[2.8s], total
[1.9s]/[28.2m], memory [11.6gb]->[11.6gb]/[11.9gb], all_pools {[Code Cache]
[9.6mb]->[9.6mb]/[48mb]}{[Par Eden Space] [48.1mb]->[7.5mb]/[216.3mb]}{[Par
Survivor Space] [27mb]->[26.9mb]/[27mb]}{[CMS Old Gen]
[11.5gb]->[11.5gb]/[11.7gb]}{[CMS Perm Gen] [47.5mb]->[47.5mb]/[82mb]}
[2012-11-11 18:09:06,259][WARN ][transport.netty ] [esn-05]
exception caught on netty layer [[id: 0xecbcec60, /10.8.2.46:35446 =>
/10.8.2.50:9300]]
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.common.compress.BufferRecycler.allocDecodeBuffer(BufferRecycler.java:137)
at
org.elasticsearch.common.compress.lzf.LZFCompressedStreamInput.(LZFCompressedStreamInput.java:46)
at
org.elasticsearch.common.compress.lzf.LZFCompressor.streamInput(LZFCompressor.java:128)
at
org.elasticsearch.common.io.stream.CachedStreamInput.cachedHandlesCompressed(CachedStreamInput.java:70)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:105)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:565)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:793)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:458)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:439)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:565)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:793)
at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:565)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:471)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:332)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 18:09:23,835][WARN ][transport.netty ] [esn-05]
exception caught on netty layer [[id: 0xf49f4a4e, /10.8.2.50:58466 =>
/10.8.2.50:9300]]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at
org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:88)
at
org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:252)
at
org.elasticsearch.common.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:157)
at
org.elasticsearch.common.compress.lzf.LZFCompressedStreamOutput.compress(LZFCompressedStreamOutput.java:52)
at
org.elasticsearch.common.compress.CompressedStreamOutput.flushBuffer(CompressedStreamOutput.java:125)
at
org.elasticsearch.common.compress.CompressedStreamOutput.writeBytes(CompressedStreamOutput.java:80)
at
org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:252)
at
org.elasticsearch.common.bytes.BytesArray.writeTo(BytesArray.java:83)
at
org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:94)
at
org.elasticsearch.common.io.stream.AdapterStreamOutput.writeBytesReference(AdapterStreamOutput.java:98)
[2012-11-11 18:14:01,018][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121105][2], node[idI32JxCRKCzeQQ16ps0IA], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1f23a32f]

org.elasticsearch.transport.RemoteTransportException:
[esn-05][inet[/10.8.2.50:9300]][indices/stats/s]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[messages_20121105][2] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
[0/0]
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 21:55:23,328][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121022][3], node[JUmuWHmITJyyOiCbqTmnjQ], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@2a1ebae5]
org.elasticsearch.index.IndexShardMissingException: [messages_20121022][3]
missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 21:55:33,328][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121022][3], node[JUmuWHmITJyyOiCbqTmnjQ], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@245a7bd6]
org.elasticsearch.index.IndexShardMissingException: [messages_20121022][3]
missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 21:55:43,357][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121022][3], node[JUmuWHmITJyyOiCbqTmnjQ], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5dfce18]
org.elasticsearch.index.IndexShardMissingException: [messages_20121022][3]
missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 21:55:43,388][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121022][3], node[JUmuWHmITJyyOiCbqTmnjQ], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@28538cab]
org.elasticsearch.index.IndexShardMissingException: [messages_20121022][3]
missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-11-11 21:55:53,328][DEBUG][action.admin.indices.stats] [esn-03]
[messages_20121022][3], node[JUmuWHmITJyyOiCbqTmnjQ], [P], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5081b244]
org.elasticsearch.index.IndexShardMissingException: [messages_20121022][3]
missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:145)
at
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)

--

IMHO your settings look good and there is no indication that OOMs could be
expected. Just a few wild guesses: since you use compression, the JVM needs
much NIO memory which is outside of the heap. 12GB for heap seems a little
bit high. You could try if switching off compression or lowering the heap
to 4-8GB changes the situation.

  • Are there JVM/kernel settings that could help prevent OOMEs such as
    this by perhaps being more aggressive at garbage collection?

Aggressive GC does not prevent OOM, it only takes longer before you get an
OOM. You are also on a Java 6 JVM where the garbage collector variety is
rather limited, on a Java 7 JVM you could examine if you are interested in
trying the new G1 garbage collector, which is slower, but more predictable,
and it has also some advantages when being used on >8GB heap.

  • Are there index or cluster settings that would help prevent corruption
    of shards in this situation?

After an OOM, the Java JVM is getting flaky, buffers are no longer safe
from being not flushed completely to persistent storage, so the best is to
minimize corruption by immediately stopping the indexing and restarting the
JVM.

  • Is there any way to reduce the overhead of rolling to a new index?

Not sure how to answer this without knowing more details about the rolling
procedure... what is the overhead? Do you copy the data? Or do you create
new weekly indices and use aliases?

Best regards,

Jörg

--

Hi Jorg,

Regarding disabling compression, I went back and it looks like we are not
enabling/compressing the source document and are using default settings
otherwise. My understanding is that index.store.compress.stored is false
by default. Is there somewhere I can check compression settings?

We just moved to Java 7 and are experimenting with different JVM GC
settings so I will let you know how that goes. We will also perhaps try
lowering heap if turning off compression doesn't help. When we are rolling
indexes we simply start writing to a new index with a new date in the
name. We are using templates to control the settings on this new index and
are not using aliases though it might be something to consider.

-drew

On Tuesday, November 13, 2012 6:58:49 PM UTC-7, Jörg Prante wrote:

IMHO your settings look good and there is no indication that OOMs could be
expected. Just a few wild guesses: since you use compression, the JVM needs
much NIO memory which is outside of the heap. 12GB for heap seems a little
bit high. You could try if switching off compression or lowering the heap
to 4-8GB changes the situation.

  • Are there JVM/kernel settings that could help prevent OOMEs such as
    this by perhaps being more aggressive at garbage collection?

Aggressive GC does not prevent OOM, it only takes longer before you get an
OOM. You are also on a Java 6 JVM where the garbage collector variety is
rather limited, on a Java 7 JVM you could examine if you are interested in
trying the new G1 garbage collector, which is slower, but more predictable,
and it has also some advantages when being used on >8GB heap.

  • Are there index or cluster settings that would help prevent corruption
    of shards in this situation?

After an OOM, the Java JVM is getting flaky, buffers are no longer safe
from being not flushed completely to persistent storage, so the best is to
minimize corruption by immediately stopping the indexing and restarting the
JVM.

  • Is there any way to reduce the overhead of rolling to a new index?

Not sure how to answer this without knowing more details about the rolling
procedure... what is the overhead? Do you copy the data? Or do you create
new weekly indices and use aliases?

Best regards,

Jörg

--