Receiving logstash error after installing watcher: SERVICE_UNAVAILABLE/1/state not recovered / initialized SERVICE_UNAVAILABLE/2/no master

Hi,

I installed Watcher on my ES cluster and started receiving this error from my logstash servers:

:timestamp=>"2015-10-21T10:56:48.613000-0700", :message=>"Failed to flush outgoing items", :outgoing_count=>37, :exception=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:215)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:67)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:153)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-10-21T10:57:49.637000-0700", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}

There were not any corresponding errors in the ES logs.

I changed the Logstash ES output protocol to 'http' to fix the problem.

Why would I receive this error using the 'node' protocol?

Thank you.

The ES client wasn't able to find a master node, most likely because it wasn't able to find the rest of the cluster. The cause of that situation can't be diagnosed without more information.

Each of the ES nodes, including the master, did show that it had joined.

What other information should I provide?

Thanks,
Darin

Dies _cat/nodes list the client?

Yes, it does show up as a node in the cluster.

Before starting logstash:

[dfisher@ops-2 ~]$ curl -s -XGET 'http://mon-esm-1:9200/_cat/nodes'
mon-esd-3.dc1.fm-hosted.com 10.1.108.111 35 29 0.15 d - mon-esd-3
mon-esd-2.dc1.fm-hosted.com 10.1.108.103 35 29 0.11 d - mon-esd-2
mon-esm-1.dc1.fm-hosted.com 10.1.108.101 18 9 0.02 d * mon-esm-1
mon-esd-1.dc1.fm-hosted.com 10.1.108.102 24 8 0.10 d - mon-esd-1

After:
[dfisher@ops-2 ~]$ curl -s -XGET 'http://mon-esm-1:9200/_cat/nodes'
mon-esd-3.dc1.fm-hosted.com 10.1.108.111 35 29 0.06 d - mon-esd-3
mon-esd-2.dc1.fm-hosted.com 10.1.108.103 35 29 0.10 d - mon-esd-2
mon-esm-1.dc1.fm-hosted.com 10.1.108.101 18 9 0.01 d * mon-esm-1
mon-esd-1.dc1.fm-hosted.com 10.1.108.102 24 8 0.51 d - mon-esd-1
mon-esp-1.dc1.fm-hosted.com 10.1.108.104 27 c - logstash-mon-esp-1.dc1.fm-hosted.com-2515-11638

Then just over 2 minutes after starting logstash the log shows the following:
{:timestamp=>"2015-10-22T12:48:12.712000-0700", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}
{:timestamp=>"2015-10-22T12:48:12.715000-0700", :message=>"Failed to flush outgoing items", :outgoing_count=>2027, :exception=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:215)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:67)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:153)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-10-22T12:49:13.781000-0700", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}
{:timestamp=>"2015-10-22T12:49:13.781000-0700", :message=>"Failed to flush outgoing items", :outgoing_count=>2027, :exception=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:215)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:67)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:153)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-10-22T12:50:14.858000-0700", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}

Try switching to the http protocol in the LS output.

Yes, the HTTP output does work and that is what I have had to do at this point.
I would prefer the node protocol and am still "curious" as to why this is not working anymore.

The only change I made was to install watcher.

Ahh, then you need the license plugin jar on your LS node.

This sort of thing is why LS 2.0 will default to HTTP, it's easier and as fast as node/transport.

That's good to know.

Thank you so much for your help!