RemoteTransportException - AlreadyClosedException[this IndexReader is closed]

This looks like an error on the internal transport but I'm not sure why it's happening. The cluster is green and otherwise happy, ingesting 6-700GB a day across 20 data nodes (20 shards per index). We are having some issues with this 2.3.1 cluster not indexing all documents versus an old 1.7.2 cluster (both being shipped to from the same Heka instances) but I'm not sure that is related to this exception.

No indexes are closed so I assume this is some kind of timeout in the internal communication? Is this a problem?

[2016-05-09 18:20:15,887][DEBUG][action.admin.cluster.node.stats] [ip-10-10-10-10] failed to execute on node [huiodfhg78ytdfghg89]
RemoteTransportException[[ip-10-10-10-10][][cluster:monitor/nodes/stats[n]]]; nested: AlreadyClosedException[this IndexReader is closed];
Caused by: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(
at org.apache.lucene.index.CompositeReader.getContext(
at org.apache.lucene.index.CompositeReader.getContext(
at org.apache.lucene.index.IndexReader.leaves(
at org.elasticsearch.index.shard.IndexShard.completionStats(
at org.elasticsearch.action.admin.indices.stats.CommonStats.(
at org.elasticsearch.indices.IndicesService.stats(
at org.elasticsearch.node.service.NodeService.stats(
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(
at org.elasticsearch.transport.TransportService$4.doRun(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

Yep, unfortunately this was a bug introduced in 2.3.0. It was just fixed in 2.3.2 (unreleased as of writing) so should be available soonish:

It's unclear to me if this is actually related to the problems you're having, it may just be coincidental. Can you describe your problem a bit more?

Also, are you checking that A) there are no bulk rejections and B) if there are rejections, you're retrying the rejected documents? A rejection isn't really an error, it's just backpressure and the cluster saying "please try again later". So if you aren't retrying the rejected docs, they will be silently dropped on the floor by your app (or Heka or whatever) and never get indexed.

Thank's for the link to the issue, didn't show up when I was searching. That would explain the exception - I take it it's harmless(-ish) unless you're running mmapfs, so we're good there.

As for the ingestion mismatch, I can't blame it on this exception, just wanted to understand this one since it was the only thing that jumped out in the logs. It definitely looks like some sort of backpressure since the rate drops off a bit and then shoots higher to catch up. Thanks!

Cool, happy to help! And yeah, that exception should be harmless unless you fall into the mmap edge case (in which case it's quite gruesome) :slight_smile:

I'm running 2.3.2 and are seeing lots of these errors.

It was not fixed in 2.3.2, but in 2.3.3.