Beats Pipeline blocked on the hour

Every hour, pipeline gets blocked and I get this error from logstash. I have a simple logstash config file which has in input { beats { ... } } and no filter, direct output { ... } do elasticsearch. Elasticsearch has no error.
Couple of servers are sending topbeat and metricbeat to it. That's all.

{:timestamp=>"2016-11-10T20:07:39.465000+0000", :message=>"Beats input: unhandled exception", :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2996:in sysread'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/lumberjack/beats/server.rb:463:inread_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/lumberjack/beats/server.rb:443:in run'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats_support/connection_handler.rb:34:inaccept'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats_support/circuit_breaker.rb:42:inexecute'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:167:inrun'"], :level=>:error}

Logstash version 2.3.4.

Are you creating hourly indices?

I am. And I am missing topbeat events at the beginning of the hour.

Tell us more about your ES cluster.
How many nodes, shards, indices, data volume, version etc.

ES 8 servers.
8 * 2 = 16 shards per index.
4 Logstash servers.

The data getting shipped is just metricbeat and topbeat sending every minute from ~50 servers.

ES 2.3.5
LS 2.3.4
JVM: 1.8.0_101

per hour index size is ~70MB. I have put various filters to limit metricbeat sending data for processes, file system etc which are not useful. Topbeat data is trimmed at Logstash to eliminate all 0 valued items.

If I convert it to a daily index, I get this error at 00:00 every night. Making it hourly index gives me error every hour.

With such a small amount of data, why are you using so many shards and hourly indices? Having a large number of very small indices/shards is quite inefficient as each shard comes with some overhead and increases the size of the cluster state.

How many indices/shards do you have in the cluster?

I plan to grow this significantly. Most of the queries are done in Last 1-3 hours, thus trying to reduce the foorprint by not having whole day's data in memory.

Currently 1,453 indices 23,234 shards.

I can see shards take over 10s to start whenever I try to create a new index and that must be why the beats are blocking the pipeline.

insertOrder timeInQueue priority source                                                                                                                               
     189769         12s URGENT   shard-started ([xml-2016.10.24][2], node[UDA77RbjSkOGvoe2s1UUGg], [P], v[1], s[INITIALIZING], a[id=kGx8PqApTkSnO29G86p5vQ], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189770         12s URGENT   shard-started ([xml-2016.10.24][3], node[JmKDOTcBSdGpbloL7hKkbQ], [P], v[1], s[INITIALIZING], a[id=5NoOfOroQlS3Mw8IbA1IHw], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189771         12s URGENT   shard-started ([xml-2016.10.24][1], node[QIYSLuh-QYqdPUwE8tlfWQ], [P], v[1], s[INITIALIZING], a[id=LGs5dhnaTH-750vdRl6V4A], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189772         12s URGENT   shard-started ([xml-2016.10.24][7], node[kXGnVW6ATNSy0_ZuveOGfA], [P], v[1], s[INITIALIZING], a[id=84lBcp7OQRKGpCTqmyxXug], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 

Elasticsearch does not keep all the recent data in memory, so I do not see an issue with having indices cover longer periods. In order to get the most out of your cluster and be able to support a large retention period, it is generally recommend that you keep the average shard size between a few GB and a few tens of GB. The exact size depends on the use case. Given that you only have 70MB indexed data per hour (less that 2GB per day), you should probably even consider using monthly indices. That would be about 50GB per month, and it you use 8 shards you get around 7 GB per shard. This should speed up querying as well as cluster state updates.

Thanks. I have just started with 50 servers to test the infra, which is producing 70MB per hour. It will go to over 2000 servers and should produce 3GB per hour. That's why going for hourly index.

Is there a way we can add some tolerance to the delay of new index creation so that the error does not come up? Any metricbeat or logstash parameter I can use to wait for ES to be ready with the new index?

Even at 3GB per hour it is only around 72GB per day, and at that volume there is no reason to go to hourly indices. If you aim for an average shard size of 10GB and assume the size of disk is roughly the same as the raw input volume (this will depend on levels of enrichment and mappings used, so could be higher or lower) you still only need around 8 primary shards per day. Using hourly indices will, as you have seen, result in a lot of shards which will cause you a lot more problems down the line.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.