Beats Pipeline blocked on the hour

Omar_Al_Zabir · November 11, 2016, 8:03am

Every hour, pipeline gets blocked and I get this error from logstash. I have a simple logstash config file which has in input { beats { ... } } and no filter, direct output { ... } do elasticsearch. Elasticsearch has no error.
Couple of servers are sending topbeat and metricbeat to it. That's all.

{:timestamp=>"2016-11-10T20:07:39.465000+0000", :message=>"Beats input: unhandled exception", :exception=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtrace=>["org/jruby/RubyIO.java:2996:in sysread'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/lumberjack/beats/server.rb:463:inread_socket'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/lumberjack/beats/server.rb:443:in run'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats_support/connection_handler.rb:34:inaccept'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats_support/circuit_breaker.rb:42:inexecute'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:211:in handle_new_connection'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-beats-2.2.9/lib/logstash/inputs/beats.rb:167:inrun'"], :level=>:error}

Logstash version 2.3.4.

warkolm · November 13, 2016, 7:56am

Are you creating hourly indices?

Omar_Al_Zabir · November 13, 2016, 9:56am

I am. And I am missing topbeat events at the beginning of the hour.

warkolm · November 13, 2016, 9:57am

Tell us more about your ES cluster.
How many nodes, shards, indices, data volume, version etc.

Omar_Al_Zabir · November 14, 2016, 4:07pm

ES 8 servers.
8 * 2 = 16 shards per index.
4 Logstash servers.

The data getting shipped is just metricbeat and topbeat sending every minute from ~50 servers.

ES 2.3.5
LS 2.3.4
JVM: 1.8.0_101

per hour index size is ~70MB. I have put various filters to limit metricbeat sending data for processes, file system etc which are not useful. Topbeat data is trimmed at Logstash to eliminate all 0 valued items.

If I convert it to a daily index, I get this error at 00:00 every night. Making it hourly index gives me error every hour.

Christian_Dahlqvist · November 14, 2016, 4:16pm

With such a small amount of data, why are you using so many shards and hourly indices? Having a large number of very small indices/shards is quite inefficient as each shard comes with some overhead and increases the size of the cluster state.

How many indices/shards do you have in the cluster?

Omar_Al_Zabir · November 14, 2016, 5:22pm

I plan to grow this significantly. Most of the queries are done in Last 1-3 hours, thus trying to reduce the foorprint by not having whole day's data in memory.

Currently 1,453 indices 23,234 shards.

I can see shards take over 10s to start whenever I try to create a new index and that must be why the beats are blocking the pipeline.

insertOrder timeInQueue priority source                                                                                                                               
     189769         12s URGENT   shard-started ([xml-2016.10.24][2], node[UDA77RbjSkOGvoe2s1UUGg], [P], v[1], s[INITIALIZING], a[id=kGx8PqApTkSnO29G86p5vQ], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189770         12s URGENT   shard-started ([xml-2016.10.24][3], node[JmKDOTcBSdGpbloL7hKkbQ], [P], v[1], s[INITIALIZING], a[id=5NoOfOroQlS3Mw8IbA1IHw], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189771         12s URGENT   shard-started ([xml-2016.10.24][1], node[QIYSLuh-QYqdPUwE8tlfWQ], [P], v[1], s[INITIALIZING], a[id=LGs5dhnaTH-750vdRl6V4A], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store] 
     189772         12s URGENT   shard-started ([xml-2016.10.24][7], node[kXGnVW6ATNSy0_ZuveOGfA], [P], v[1], s[INITIALIZING], a[id=84lBcp7OQRKGpCTqmyxXug], unassigned_info[[reason=INDEX_CREATED], at[2016-11-14T17:23:13.122Z]]), reason [after recovery from store]

Christian_Dahlqvist · November 14, 2016, 5:38pm

Elasticsearch does not keep all the recent data in memory, so I do not see an issue with having indices cover longer periods. In order to get the most out of your cluster and be able to support a large retention period, it is generally recommend that you keep the average shard size between a few GB and a few tens of GB. The exact size depends on the use case. Given that you only have 70MB indexed data per hour (less that 2GB per day), you should probably even consider using monthly indices. That would be about 50GB per month, and it you use 8 shards you get around 7 GB per shard. This should speed up querying as well as cluster state updates.

Omar_Al_Zabir · November 18, 2016, 11:06am

Thanks. I have just started with 50 servers to test the infra, which is producing 70MB per hour. It will go to over 2000 servers and should produce 3GB per hour. That's why going for hourly index.

Is there a way we can add some tolerance to the delay of new index creation so that the error does not come up? Any metricbeat or logstash parameter I can use to wait for ES to be ready with the new index?

Christian_Dahlqvist · November 18, 2016, 11:25am

Even at 3GB per hour it is only around 72GB per day, and at that volume there is no reason to go to hourly indices. If you aim for an average shard size of 10GB and assume the size of disk is roughly the same as the raw input volume (this will depend on levels of enrichment and mappings used, so could be higher or lower) you still only need around 8 primary shards per day. Using hourly indices will, as you have seen, result in a lot of shards which will cause you a lot more problems down the line.

system · December 16, 2016, 11:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Beats input blocked Logstash	8	1196	July 6, 2017
Logstash Error: Beats input: the pipeline is blocked, temporary refusing new connection Logstash	3	1324	July 6, 2017
Beats input: the pipeline is blocked, temporary refusing new connection Beats filebeat	6	8045	July 5, 2017
Beats input: the pipe line is blocked \|\| failed to create shard Elasticsearch	2	502	July 5, 2017
Error in shipping logs from Logstash to Elasticsearch Logstash	7	1129	July 6, 2017

Beats Pipeline blocked on the hour

Related topics