Timeout Errors when creating lots of new Indices?

Hello, when I have an Elasticsearch cluster with nothing but .kibana and .marvel indicies in it, my consumer reading in the files will periodically encounter exceptions and close, due to the fact that ES is closing the connection

17-10-20 17:15:33 epgidvledw1044 ERROR [signafire.packrat:23] - Uncaught exception on async-thread-macro-2
                                                java.lang.Thread.run                 Thread.java:  745
                  java.util.concurrent.ThreadPoolExecutor$Worker.run     ThreadPoolExecutor.java:  622
                   java.util.concurrent.ThreadPoolExecutor.runWorker     ThreadPoolExecutor.java: 1152
                                   clojure.core.async/thread-call/fn                   async.clj:  439
                       signafire.packrat.components.rodent.Rodent/fn                  rodent.clj:   97
                    signafire.packrat.components.rodent.Rodent/fn/fn                  rodent.clj:  131
                                                  clojure.core/dorun                    core.clj: 3009
                                                    clojure.core/seq                    core.clj:  137
                                                 clojure.core/map/fn                    core.clj: 2629
                 signafire.packrat.components.rodent.Rodent/fn/fn/fn                  rodent.clj:  136
              signafire.packrat.components.rabbitmq/publish-failure!                rabbitmq.clj:   78
                                               langohr.queue/declare                   queue.clj:   75
com.rabbitmq.client.impl.recovery.AutorecoveringChannel.queueDeclare  AutorecoveringChannel.java:  266
                      com.rabbitmq.client.impl.ChannelN.queueDeclare               ChannelN.java:  844
                  com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc             AMQChannel.java:  118
                      com.rabbitmq.client.impl.AMQChannel.privateRpc             AMQChannel.java:  219
                             com.rabbitmq.client.impl.AMQChannel.rpc             AMQChannel.java:  242
                    com.rabbitmq.client.impl.AMQChannel.quiescingRpc             AMQChannel.java:  251
               com.rabbitmq.client.impl.AMQChannel.quiescingTransmit             AMQChannel.java:  316
               com.rabbitmq.client.impl.AMQChannel.quiescingTransmit             AMQChannel.java:  334
                        com.rabbitmq.client.impl.AMQCommand.transmit             AMQCommand.java:  125
                        com.rabbitmq.client.impl.AMQConnection.flush          AMQConnection.java:  518
                   com.rabbitmq.client.impl.SocketFrameHandler.flush     SocketFrameHandler.java:  150
                                      java.io.DataOutputStream.flush       DataOutputStream.java:  123
                                  java.io.BufferedOutputStream.flush   BufferedOutputStream.java:  140
                            java.io.BufferedOutputStream.flushBuffer   BufferedOutputStream.java:   82
                                   java.net.SocketOutputStream.write     SocketOutputStream.java:  161
                             java.net.SocketOutputStream.socketWrite     SocketOutputStream.java:  115
                            java.net.SocketOutputStream.socketWrite0      SocketOutputStream.java     
java.net.SocketException: Broken pipe (Write failed)
         java.lang.Error: java.net.SocketException: Broken pipe (Write failed)

Packrat is the consumer I have that reads documents off RabbitMQ and into Elasticsearch search. When running this program under supervision, it will restart and grind through this process until the maximum number of indices needed are created and then everything runs well. I've found some other issues that seem to be related, but I'd like to note that I'm only creating 1 index for each month in a year and I'm also only allocating 5 shards for each date before 2000 and 25 for each date after 2000. I've attached the seemingly related issues below:

The ES cluster is 16 nodes, with 1 dedicated master, 2 dedicated clients and the rest are data nodes. The servers they are running on are all 8 core intel xeon e5 CPUs and 64GM of ram.


What do the Elasticsearch logs show at this time? How do you know it's related to creating a lot of new indices? How many indices are you creating, and why?

Why so many for the newer data?

That's bad, see https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes

Hi Mark,

To answer your questions:
[2017-10-24 17:15:57,020][ERROR][marvel.agent ] [this.is.the.server.name] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:748)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2017.10.24], type [node_stats], id [AV9QPQR2OX3ZaVlWAb_2], message [UnavailableShardsException[[.marvel-es-1-2017.10.24][0] primary shar
d is not active Timeout: [1m], request: [BulkShardRequest to [.marvel-es-1-2017.10.24] containing [1] requests]]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2017.10.24], type [node_stats], id [AV9QPQR2OX3ZaVlWAb_2], message [UnavailableShardsException[[.marvel-es-1-2017.10.24][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [.marvel-es-1-2017.10.24] containing [1] requests]]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:118)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more

Also there are so many more shards for the dates after 2000 because that's the lion's share of the data. The indices get created pertaining to the date the document was generated. an example of an index would be sf_ab_doc__1998_03__v2.1 . Right now in a testing ES cluster, we have about 606 indices with a few million documents and about 6500 shards.

That's way too many for such a small dataset.

oh okay, I didn't realize that. What number of indices/shards do you think I should have? Also do you think changing the way we index would help fix this problem?

Reducing the shard count will likely solve the problem.

So I knocked the shards down to 5 per index, and that solved the problem! Thanks! I do have a few follow up questions though:

  1. Why is creating shards, not indices what caused ES to time out?
  2. Adding documents to ES is still relatively slow, at around 30 documents a second. What else can I do to speed this process up?
  1. The indices are stored in cluster state. The shards are stored in a routing table, which is also in the cluster state. Every time a shard changes state (when created, started, recovered, initialised, moved) it needs to update the routing table and thus cluster state. And in 2.X the entire cluster state was updated and then sent to all other nodes in the cluster, whereas in 5.X we send just the changes that are made which is way more efficient.
  2. Check hot_threads?

Cluster state updates are generally done single-threaded in order to ensure consistency. The changes are then propagated to all the nodes in the cluster. In addition to adding or altering indices and changing the location and state of shards, changes to mappings also require the cluster state to be updated. If you therefore are using dynamic mappings and have frequent changes/additions, this will result in an even larger number of cluster state changes.

Thanks for the explanation guys! Really helpful insights. I see an update of ES in my future.....

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.