Cluster will not leave 'red' state no matter what I do

I have an elasticsearch cluster that will not leave a red state no matter what I try.

    "cluster_name" : "es-logstash",
      "status" : "red",
      "timed_out" : false,
      "number_of_nodes" : 13,
      "number_of_data_nodes" : 2,
      "active_primary_shards" : 3739,
      "active_shards" : 7189,
      "relocating_shards" : 0,
      "initializing_shards" : 6,
      "unassigned_shards" : 283,
      "number_of_pending_tasks" : 102

curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'


curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
    "commands": [{
        "allocate": {
            "index": "logstash-syslog-events-2015.06.26",
            "shard": 3,
            "node": "es-logstash-n1",
            "allow_primary": 1
        }
    }]
}'

I have somehow ended up with index shards with strange nodes:

{"state":"STARTED","primary":true,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":0,"index":"logstash-windows-events2015.04.23"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":3,"index":"logstash-windows-events2015.04.23"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":1,"index":"logstash-windows-events2015.04.23"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":2,"index":"logstash-windows-events2015.04.23"}
{"state":"STARTED","primary":true,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":0,"index":".marvel-2015.03.09"}
{"state":"STARTED","primary":true,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":4,"index":"logstash-windows-events2015.04.20"}
{"state":"STARTED","primary":true,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":0,"index":"logstash-windows-events2015.04.20"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":3,"index":"logstash-windows-events2015.04.20"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":1,"index":"logstash-windows-events2015.04.20"}
{"state":"STARTED","primary":false,"node":"0WjU5UtwSrS0bqnJR6Vqaw","relocating_node":null,"shard":2,"index":"logstash-windows-events2015.04.20"}

Is it possible to take everything that is unassigned and 'flush' it's status so it reallocates between the two nodes in my cluster?

logstash-mail-events-2015.05.15     0 r UNASSIGNED
logstash-mail-events-2015.05.15     3 r UNASSIGNED
logstash-mail-events-2015.05.15     1 r UNASSIGNED
logstash-mail-events-2015.05.15     2 r UNASSIGNED
logstash-mail-events-2015.05.18     2 r UNASSIGNED
logstash-mail-events-2015.05.18     0 r UNASSIGNED
logstash-mail-events-2015.05.18     3 r UNASSIGNED
logstash-mail-events-2015.05.18     1 r UNASSIGNED
logstash-mail-events-2015.05.18     4 r UNASSIGNED
logstash-mail-events-2015.05.16     4 r UNASSIGNED

hello @Don_Pich , i had the same problem in the past, i fixed using the reroute API.
check how use hear :
https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html

hope it helps!

Read carefully (or post) the response from your reroute command. I've had a similar problem where reroute throws an error and in it the error has a bunch of conditions. Something like 'YES: rule 1; YES: rule 2; NO: rule 3; YES rule 4' etc. Usually that'll give you somewhere to start.

How did your cluster get into that state?
Have you check your ES logs (on the master node) to see what is happening?

So I have dug into this further.

The server itself has 32 gigs of physical ram. My heap size is currently set to 12 Gig.

When I look at the system's utilization, it looks like all of the ram is being consumed by elasticsearch. The exert below is with logstash not running.

top - 15:12:19 up  6:33,  3 users,  load average: 1.18, 1.27, 1.27
Tasks: 122 total,   1 running, 121 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.7 us,  2.1 sy,  0.0 ni, 79.0 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  33021480 total, 32760500 used,   260980 free,   379268 buffers
KiB Swap:  9783292 total,    13048 used,  9770244 free, 18322156 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
25635 elastics  20   0 65.5g  12g 267m S 166.8 40.4 164:25.72 java
 2539 root      20   0 98.8m 3992 3664 S   0.3  0.0   0:20.59 vmtoolsd
    1 root      20   0 10648  708  680 S   0.0  0.0   0:01.58 init
    2 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S   0.0  0.0   0:00.85 ksoftirqd/0
    4 root      20   0     0    0    0 S   0.0  0.0   0:01.38 kworker/0:0
    5 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kworker/u:0
    6 root      rt   0     0    0    0 S   0.0  0.0   0:00.33 migration/0

So other than the HEAP size, how do I limit virtual memory size, or is that my problem in the first place?

That is controlled by the OS, but it's unlikely to be the cause.

Did you look in the logs as I mentioned?

My elasticsearch logs are busted (i.e. not being generated). I am getting nothing from them.

I will need to analyze and set them up before I can proceed on this.

Alright, I have some logs being kicked out of the server now. I have ignored the 'info' messages.

I honestly don't know to look for in these, but this is after restarting the nodes in the cluster.

[2015-05-21 14:02:26,819][ERROR][marvel.agent.exporter    ] [es-logstash-n1] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.05.21/_bulk]: SocketTimeoutException[Read timed out]
[2015-05-21 14:02:27,073][DEBUG][action.bulk              ] [es-logstash-n1] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-05-21 14:03:14,461][DEBUG][action.bulk              ] [es-logstash-n1] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-05-21 14:03:37,182][DEBUG][action.bulk              ] [es-logstash-n1] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-05-21 14:03:37,185][ERROR][marvel.agent.exporter    ] [es-logstash-n1] create failure (index:[.marvel-2015.05.21] type: [node_stats]): UnavailableShardsException[[.marvel-2015.05.21][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@61ddb852]
[2015-05-21 14:04:16,914][DEBUG][action.bulk              ] [es-logstash-n1] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-05-21 14:04:47,826][DEBUG][action.bulk              ] [es-logstash-n1] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-05-21 14:04:47,827][ERROR][marvel.agent.exporter    ] [es-logstash-n1] create failure (index:[.marvel-2015.05.21] type: [node_stats]): UnavailableShardsException[[.marvel-2015.05.21][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@7f116a62]
[2015-05-21 14:05:28,490][DEBUG][action.bulk              ] [es-logstash-n1] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-05-21 14:05:33,802][WARN ][http.netty               ] [es-logstash-n1] Caught exception while handling client http traffic, closing connection [id: 0x6f86bafd, /192.168.1.80:44717 => /192.168.1.72:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: HTTP content length exceeded 104857600 bytes.
        at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:169)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:135)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-05-21 14:05:35,435][WARN ][http.netty               ] [es-logstash-n1] Caught exception while handling client http traffic, closing connection [id: 0x2302f3f5, /192.168.1.80:44723 => /192.168.1.72:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: HTTP content length exceeded 104857600 bytes.
        at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:169)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:135)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-05-21 14:23:39,891][WARN ][action.bulk              ] [es-logstash-n1] Failed to perform indices:data/write/bulk[s] on remote replica [es-logstash-n2][hjPa7QykSSyj_ezX0Y15CA][logstash][inet[/192.168.1.80:9300]][logstash-syslog-events-2015.05.21][4]
org.elasticsearch.transport.RemoteTransportException: [es-logstash-n2][inet[/192.168.1.80:9300]][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.index.engine.CreateFailedEngineException: [logstash-syslog-events-2015.05.21][4] Create failed for [logs#9XCHh4pFRC-cn1tfEzGkjg]
        at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:262)
        at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:470)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:583)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:249)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:228)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /elasticsearch/es-logstash/nodes/0/indices/logstash-syslog-events-2015.05.21/4/index/_3t.fdt (Too many open files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:384)
        at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:277)
        at org.apache.lucene.store.FileSwitchDirectory.createOutput(FileSwitchDirectory.java:152)
        at org.apache.lucene.store.RateLimitedFSDirectory.createOutput(RateLimitedFSDirectory.java:40)
        at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:69)
        at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:69)
        at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:44)
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:113)
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:120)
        at org.apache.lucene.index.DefaultIndexingChain.initStoredFieldsWriter(DefaultIndexingChain.java:83)
        at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:270)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:314)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252)
        at org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:343)
        at org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:285)
        at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:256)
        ... 9 more
[2015-05-21 14:23:47,853][WARN ][indices.cluster          ] [es-logstash-n1] [[logstash-syslog-events-2015.03.16][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-syslog-events-2015.03.16][0]: Recovery failed from [es-logstash-n2][hjPa7QykSSyj_ezX0Y15CA][logstash][inet[/192.168.1.80:9300]] into [es-logstash-n1][yp__I9bqSk67NMMXDp_9Pw][logstash][inet[logstash.realtruck.com/192.168.1.72:9300]]
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [es-logstash-n2][inet[/192.168.1.80:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [logstash-syslog-events-2015.03.16][0] Phase[1] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:842)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [logstash-syslog-events-2015.03.16][0] Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:837)
        ... 10 more
Caused by: java.nio.file.FileSystemException: /elasticsearch/es-logstash/nodes/0/indices/logstash-syslog-events-2015.03.16/0/index/_bww.si: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:334)
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
        at org.apache.lucene.store.FileSwitchDirectory.openInput(FileSwitchDirectory.java:172)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:683)
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
        at org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:51)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359)
        at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:110)
        at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:158)
        at org.elasticsearch.index.store.Store.access$400(Store.java:88)
        at org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:781)
        at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:768)
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:222)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:159)
        ... 11 more
[2015-05-21 14:23:57,657][WARN ][indices.cluster          ] [es-logstash-n1] [[logstash-syslog-events-2015.08.19][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-syslog-events-2015.08.19][4]: Recovery failed from [es-logstash-n2][hjPa7QykSSyj_ezX0Y15CA][logstash][inet[/192.168.1.80:9300]] into [es-logstash-n1][yp__I9bqSk67NMMXDp_9Pw][logstash][inet[logstash.realtruck.com/192.168.1.72:9300]]
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [es-logstash-n2][inet[/192.168.1.80:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [logstash-syslog-events-2015.08.19][4] Phase[1] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:842)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [logstash-syslog-events-2015.08.19][4] Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:837)
        ... 10 more
Caused by: java.io.IOException: directory '/elasticsearch/es-logstash/nodes/0/indices/logstash-syslog-events-2015.08.19/4/index' exists and is a directory, but cannot be listed: list() returned null
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:226)
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:237)
        at org.elasticsearch.index.store.fs.DefaultFsDirectoryService$1.listAll(DefaultFsDirectoryService.java:57)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:532)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528)
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:219)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:159)
        ... 11 more
[2015-05-21 14:24:00,948][WARN ][indices.cluster          ] [es-logstash-n1] [[logstash-syslog-events-2015.11.18][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-syslog-events-2015.11.18][0]: Recovery failed from [es-logstash-n2][hjPa7QykSSyj_ezX0Y15CA][logstash][inet[/192.168.1.80:9300]] into [es-logstash-n1][yp__I9bqSk67NMMXDp_9Pw][logstash][inet[logstash.realtruck.com/192.168.1.72:9300]]
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [es-logstash-n2][inet[/192.168.1.80:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [logstash-syslog-events-2015.11.18][0] Phase[1] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:842)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [logstash-syslog-events-2015.11.18][0] Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:837)
        ... 10 more
Caused by: java.io.IOException: directory '/elasticsearch/es-logstash/nodes/0/indices/logstash-syslog-events-2015.11.18/0/index' exists and is a directory, but cannot be listed: list() returned null
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:226)
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:237)
        at org.elasticsearch.index.store.fs.DefaultFsDirectoryService$1.listAll(DefaultFsDirectoryService.java:57)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:532)
        at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528)
        at org.elasticsearch.index.store.Store.getMetadata(Store.java:219)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:159)
        ... 11 more
[2015-05-21 14:24:21,244][WARN ][indices.cluster          ] [es-logstash-n1] [[logstash-syslog-events-2015.09.02][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-syslog-events-2015.09.02][3]: Recovery failed from [es-logstash-n2][hjPa7QykSSyj_ezX0Y15CA][logstash][inet[/192.168.1.80:9300]] into [es-logstash-n1][yp__I9bqSk67NMMXDp_9Pw][logstash][inet[logstash.realtruck.com/192.168.1.72:9300]]
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [es-logstash-n2][inet[/192.168.1.80:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [logstash-syslog-events-2015.09.02][3] Phase[1] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:842)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [logstash-syslog-events-2015.09.02][3] Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:837)
        ... 10 more
Caused by: java.nio.file.FileSystemException: /elasticsearch/es-logstash/nodes/0/indices/logstash-syslog-events-2015.09.02/3/index/_0.si: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:334)
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
        at org.apache.lucene.store.FileSwitchDirectory.openInput(FileSwitchDirectory.java:172)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
        at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:683)
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
        at

Sounds like you need to increase your ulimit.

Here is what my system is set to. It looks like I have plenty of room, but I don't fully understand these vaules. What is recommended for elasticsearch?

root@logstash:~# cat /proc/sys/fs/file-max
3300878
root@logstash:~# ulimit -Hn
4096
root@logstash:~# ulimit -Sn
1024
root@logstash:~#
1 Like
root@logstash:~# curl -XGET 'http://localhost:9200/_nodes?os=true&process=true&pretty=true'
{
  "cluster_name" : "es-logstash",
  "nodes" : {
    "HYcEKbYWSyyDXyq1BjxiHw" : {
      "name" : "es-logstash-n1",
      "transport_address" : "inet[logstash.realtruck.com/192.168.1.72:9300]",
      "host" : "logstash",
      "ip" : "192.168.1.72",
      "version" : "1.5.2",
      "build" : "62ff986",
      "http_address" : "inet[/192.168.1.72:9200]",
      "settings" : {
        "indicies" : {
          "fielddata" : {
            "cache" : {
              "size" : "50%"
            }
          }
        },
        "node" : {
          "name" : "es-logstash-n1"
        },
        "bootstrap" : {
          "mlockall" : "true"
        },
        "client" : {
          "type" : "node"
        },
        "name" : "es-logstash-n1",
        "pidfile" : "/var/run/elasticsearch.pid",
        "path" : {
          "data" : "/elasticsearch",
          "work" : "/tmp/elasticsearch",
          "home" : "/usr/share/elasticsearch",
          "conf" : "/etc/elasticsearch",
          "logs" : "/var/log/elasticsearch"
        },
        "cluster" : {
          "name" : "es-logstash"
        },
        "config" : "/etc/elasticsearch/elasticsearch.yml",
        "discovery" : {
          "zen" : {
            "minimum_master_nodes" : "2",
            "ping" : {
              "unicast" : {
                "hosts" : [ "192.168.1.72", "192.168.1.80" ]
              },
              "multicast" : {
                "enabled" : "false"
              },
              "timeout" : "10s"
            }
          }
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 8,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2660,
          "total_cores" : 8,
          "total_sockets" : 2,
          "cores_per_socket" : 4,
          "cache_size_in_bytes" : 12288
        },
        "mem" : {
          "total_in_bytes" : 33813995520
        },
        "swap" : {
          "total_in_bytes" : 10018091008
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 17197,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      },
      "jvm" : {
        "pid" : 17197,
        "version" : "1.7.0_79",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.79-b02",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1432237140439,
        "mem" : {
          "heap_init_in_bytes" : 6442450944,
          "heap_max_in_bytes" : 6372720640,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 6372720640
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      },
      "thread_pool" : {
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 8,
          "max" : 8,
          "queue_size" : "200"
        },
        "get" : {
          "type" : "fixed",
          "min" : 8,
          "max" : 8,
          "queue_size" : "1k"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 8,
          "max" : 8,
          "queue_size" : "1k"
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 8,
          "max" : 8,
          "queue_size" : "50"
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 24,
          "max" : 24,
          "queue_size" : "1k"
        },
        "listener" : {
          "type" : "fixed",
          "min" : 4,
          "max" : 4,
          "queue_size" : -1
        },
        "percolate" : {
          "type" : "fixed",
          "min" : 8,
          "max" : 8,
          "queue_size" : "1k"
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 4,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "192.168.1.72",
          "name" : "eth0",
          "mac_address" : "00:50:56:99:D5:BB"
        }
      },
      "transport" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9300]",
        "publish_address" : "inet[logstash.realtruck.com/192.168.1.72:9300]",
        "profiles" : { }
      },
      "http" : {
        "bound_address" : "inet[/0:0:0:0:0:0:0:0:9200]",
        "publish_address" : "inet[/192.168.1.72:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ {
        "name" : "marvel",
        "version" : "1.3.1",
        "description" : "Elasticsearch Management & Monitoring",
        "url" : "/_plugin/marvel/",
        "jvm" : true,
        "site" : true
      } ]
    }
  }
}