"Failed to flush outgoing items" using shield

security

(Tim Dunphy) #1

Hey guys,

I can't seem to use Logstash with ES Shield. I set this up on my test instances, because I'll be expected to make this work on the company setup this coming week. So lessons learned here will apply there.

It seems that the ES nodes can all talk to one another and nodes all are showing up in the ES kopf plugin.

But since starting up Logstash I have this message repeating in the logs:

{:timestamp=>"2015-07-12T20:23:02.764000-0400", :message=>"Failed to flush outgoing items", :outgoing_count=>1, :exception=>org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];, :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:210)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:73)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:148)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

And I also see this error:

{:timestamp=>"2015-07-12T20:28:40.917000-0400", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}

Here's how I have the output configured in Logstash:

output {

  elasticsearch {
           protocol => 'http'
           host => 'localhost'
           user => 'ls_admin'
           password => 'secretsauce'
           index => 'logstash-%{+YYYY.MM.dd}'
    }
  stdout { codec => rubydebug }
}

I was able to create an index in LS by running it on the command line using this command:

logstash -e 'input { stdin { } } output { elasticsearch {protocol => http user => admin password => secretsauce  host => localhost } }'

But, like I said, using the output I have in the file, I get the error I mention above. And even if I take that command line I just showed you and stash it in a file I get the same error. What the blinkin' hell...?


(Mark Walkom) #2

That means ES has a problem, you'll have to dig as to why.


(Tim Dunphy) #3

I find that I can curl the health of the cluster and it looks good:

#curl --user admin:secretsauce   -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 1,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

But when I take a look in the logs I see a bunch of stuff going wrong:

[2015-07-12 20:31:04,838][WARN ][discovery.zen.publish    ] [JF_ES1] timed out waiting for all nodes to process published state [50] (timeout [30s], pending nodes: [[JF_ES3][64jySUXkQuiHwGL_MSOxag][es3][inet[/66.147.235.108:9300]]{master=false}])
[2015-07-12 20:31:04,856][WARN ][cluster.service          ] [JF_ES1] cluster state update task [delete-index [logstash-2015.07.13]] took 30s above the warn threshold of 30s
[2015-07-12 20:31:14,449][DEBUG][action.admin.cluster.node.stats] [JF_ES1] failed to execute on node [64jySUXkQuiHwGL_MSOxag]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [JF_ES3][inet[/66.147.235.108:9300]][cluster:monitor/nodes/stats[n]] request_id [6168] timed out after [15002ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-07-12 20:31:25,620][WARN ][shield.transport         ] [JF_ES1] Received response for a request that has timed out, sent [26172ms] ago, timed out [11170ms] ago, action [cluster:monitor/nodes/stats[n]], node [[JF_ES3][64jySUXkQuiHwGL_MSOxag][es3][inet[/66.147.235.108:9300]]{master=false}], id [6168]
[2015-07-12 20:31:27,329][WARN ][shield.transport         ] [JF_ES1] Received response for a request that has timed out, sent [53925ms] ago, timed out [23924ms] ago, action [internal:discovery/zen/fd/ping], node [[JF_ES3][64jySUXkQuiHwGL_MSOxag][es3][inet[/66.147.235.108:9300]]{master=false}], id [6123]
[2015-07-12 20:33:42,985][INFO ][cluster.service          ] [JF_ES1] removed {[JF_ES3][64jySUXkQuiHwGL_MSOxag][es3][inet[/66.147.235.108:9300]]{master=false},}, reason: zen-disco-node_failed([JF_ES3][64jySUXkQuiHwGL_MSOxag][es3][inet[/66.147.235.108:9300]]{master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-07-12 20:33:43,011][DEBUG][action.admin.cluster.node.stats] [JF_ES1] failed to execute on node [64jySUXkQuiHwGL_MSOxag]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][cluster:monitor/nodes/stats[n]] disconnected
[2015-07-12 20:33:43,012][DEBUG][action.admin.cluster.node.info] [JF_ES1] failed to execute on node [64jySUXkQuiHwGL_MSOxag]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][cluster:monitor/nodes/info[n]] disconnected
[2015-07-12 20:33:43,024][DEBUG][action.admin.cluster.node.info] [JF_ES1] failed to execute on node [64jySUXkQuiHwGL_MSOxag]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][cluster:monitor/nodes/info[n]] disconnected
[2015-07-12 20:33:43,029][DEBUG][action.admin.indices.stats] [JF_ES1] [.kibana][0], node[64jySUXkQuiHwGL_MSOxag], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@6ad9f4ba]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][indices:monitor/stats[s]] disconnected
[2015-07-12 20:33:43,030][DEBUG][action.admin.indices.stats] [JF_ES1] [.kibana][0], node[64jySUXkQuiHwGL_MSOxag], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@44710845]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][indices:monitor/stats[s]] disconnected
[2015-07-12 20:33:43,030][DEBUG][action.admin.indices.stats] [JF_ES1] [.kibana][0], node[64jySUXkQuiHwGL_MSOxag], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@68f3035a]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][indices:monitor/stats[s]] disconnected
[2015-07-12 20:33:43,031][DEBUG][action.admin.indices.stats] [JF_ES1] [.kibana][0], node[64jySUXkQuiHwGL_MSOxag], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@344779c9]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][indices:monitor/stats[s]] disconnected
[2015-07-12 20:33:43,033][DEBUG][action.admin.cluster.node.info] [JF_ES1] failed to execute on node [64jySUXkQuiHwGL_MSOxag]
org.elasticsearch.transport.NodeDisconnectedException: [JF_ES3][inet[/66.147.235.108:9300]][cluster:monitor/nodes/info[n]] disconnected

I see that error repeated 20 times in the ES logs. This all started happening since introducing shield.

All I thought there was to setting up shield was what I found in this getting started guide:

Shield Getting Started

But maybe there's some configuration in the yaml file that has to happen in order for this to be able to work correctly?


(Mark Walkom) #4

Do you have Shield installed on all nodes in the cluster?


(Tim Dunphy) #5

Yup. I just double verified on the other two nodes. I had to log into both using a browser and pointing it to port 9200 on each of them.


(Jay Modi) #6

What does your shield configuration look like? Are you using SSL?

It seems like something is really slowing down the ping requests. We've seen that in the past on some very small AWS instances when using SSL.


(Tim Dunphy) #7

I haven't done anything to configure shield other than to install it and setup the admin user. My nodes are kinda small. The first ES node has 6GB of ram and the other two have 2GB of ram. I'm not using SSL. Although I will say that using it was in the plan. But I had to make sure that LS is working with ES & shield first. And it isn't.

Maybe that's why the communications problems are happening with shield?

I have a feeling that if I remove shield from this equaition LS and ES are going to start playing nicely together again.

This may not be a problem for the work machines I'm going to work on this week. They all have 8GB of memory installed.


(Jay Modi) #8

Those machines sound fine, what kind of CPU?

I'd take logstash out of the equation for now and focus on getting elasticsearch running with Shield. A couple other questions, are you using unicast discovery in elasticsearch and have you disabled multicast?


(Tim Dunphy) #9

The first ES node (also logstash) has a 3GhZ cpu and the other two nodes have 2.4GhZ cpu's each.

The first ES node has 2 cpu's and the other two nodes have only one cpu per.

My elasticsearch config is very basic.

Node 1:

node.name: "JF_ES1"
node.master: true
node.data: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["localhost"]

Node 2:

    node.name: "JF_ES2"
    node.data: true
   node.master: false
   discovery.zen.ping.multicast.enabled: false
   discovery.zen.ping.unicast.hosts: ["10.10.10.25"] 

Node 3:

    node.name: "JF_ES3"
    node.master: false
    node.data: true
   discovery.zen.ping.multicast.enabled: false
   discovery.zen.ping.unicast.hosts: ["10.10.10.25"]

Thanks for your help!


(Jay Modi) #10

Thanks for the information, can you try shutting down all the nodes? Then Bring up Node 1 and verify that there are no errors in the logs for that node after starting and it will elect itself as the master. Then trying starting one of the other nodes and see if you get errors? If you get errors with one node, try shutting it down and bringing up the other node to see if it also produces errors. If you do have errors, can you post the log lines from the nodes that are communicating?


(Tim Dunphy) #11

Hey again,

Ok, so I took your advice. And I brought up just one node at a time. I actually took my own time doing this. I started the first node, waited a half hour. Started the second node, waited another half hour. Then started the 3rd node and went out for lunch.

And when I got back I found absolutely no errors in the logs. I put a new log file into place so that all I would have are log entries since the most recent startups. And this is literally all I have:

[2015-07-13 10:45:08,555][INFO ][node                     ] [JF_ES1] version[1.6.0], pid[26382], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-07-13 10:45:08,556][INFO ][node                     ] [JF_ES1] initializing ...
[2015-07-13 10:45:08,615][INFO ][plugins                  ] [JF_ES1] loaded [shield, license], sites [kopf, head]
[2015-07-13 10:45:08,662][INFO ][env                      ] [JF_ES1] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [35.7gb], net total_space [44.4gb], types [rootfs]
[2015-07-13 10:45:09,125][INFO ][transport                ] [JF_ES1] Using [org.elasticsearch.shield.transport.ShieldServerTransportService] as transport service, overridden by [shield]
[2015-07-13 10:45:09,125][INFO ][transport                ] [JF_ES1] Using [org.elasticsearch.shield.transport.netty.ShieldNettyTransport] as transport, overridden by [shield]
[2015-07-13 10:45:09,126][INFO ][http                     ] [JF_ES1] Using [org.elasticsearch.shield.transport.netty.ShieldNettyHttpServerTransport] as http transport, overridden by [shield]
[2015-07-13 10:45:12,357][INFO ][node                     ] [JF_ES1] initialized
[2015-07-13 10:45:12,357][INFO ][node                     ] [JF_ES1] starting ...
[2015-07-13 10:45:13,376][INFO ][shield.transport         ] [JF_ES1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.10.10.25:9300]}
[2015-07-13 10:45:13,386][INFO ][discovery                ] [JF_ES1] elasticsearch/TFacMDCdTnWHdbYvb-7KDA
[2015-07-13 10:45:16,432][INFO ][cluster.service          ] [JF_ES1] new_master [JF_ES1][TFacMDCdTnWHdbYvb-7KDA][logs][inet[/10.10.10.25:9300]]{master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-13 10:45:16,505][INFO ][shield.license           ] [JF_ES1] enabling license for [shield]
[2015-07-13 10:45:16,505][INFO ][license.plugin.core      ] [JF_ES1] license for [shield] - valid
[2015-07-13 10:45:16,514][ERROR][shield.license           ] [JF_ES1]
#
# Shield license will expire on [Wednesday, August 12, 2015]. Cluster health, cluster stats and indices stats operations are
# blocked on Shield license expiration. All data operations (read and write) continue to work. If you
# have a new license, please update it. Otherwise, please reach out to your support contact.
#
[2015-07-13 10:45:16,741][INFO ][gateway                  ] [JF_ES1] recovered [1] indices into cluster_state
[2015-07-13 10:45:16,760][INFO ][http                     ] [JF_ES1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.10.10.25:9200]}
[2015-07-13 10:45:16,760][INFO ][node                     ] [JF_ES1] started
[2015-07-13 11:27:29,116][INFO ][cluster.service          ] [JF_ES1] added {[JF_ES2][5NQxH72FTZeDNqfpQMqSPw][es2][inet[/10.10.10.26:9300]]{master=false},}, reason: zen-disco-receive(join from node[[JF_ES2][5NQxH72FTZeDNqfpQMqSPw][es2][inet[/10.10.10.26:9300]]{master=false}])
[2015-07-13 12:00:12,221][INFO ][cluster.service          ] [JF_ES1] added {[JF_ES3][mUzjOGHsRTqwll25prK5OQ][es3][inet[/10.10.10.27:9300]]{master=false},}, reason: zen-disco-receive(join from node[[JF_ES3][mUzjOGHsRTqwll25prK5OQ][es3][inet[/10.10.10.27:9300]]{master=false}])

And that was leaving Logstash out of the picture the entire time! And shield was definitely in place. Any thoughts on what I should try next?


(Jay Modi) #12

The logs look great. Cluster has formed and there are no issues. So the next step is to turn logstash back on and see what happens when you have indexing against the cluster.


(Tim Dunphy) #13

Cool! Yeah I though so too. But I am still not able to create an index in Logstash with ES and shield in place. I'm still getting the same error as before:

{:timestamp=>"2015-07-13T14:47:55.050000-0400", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}
{:timestamp=>"2015-07-13T14:47:55.052000-0400", :message=>"Failed to flush outgoing items", :outgoing_count=>1, :exception=>org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];, :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:210)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:73)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:148)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-07-13T14:48:57.169000-0400", :message=>"Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :level=>:error}

And my output still looks the same and should be working from what I understand. Except it isn't!

#cat 30-logstash.conf
output {
  elasticsearch {
           protocol => 'http'
           host => 'localhost'
           user => 'ls_admin'
           password => 'secretsauce'
           index => 'logstash-%{+YYYY.MM.dd}'
    }
  stdout { codec => rubydebug }
}

And there is absolutely no activity in the ES logs even after starting up logstash and verifying that it was running.

Any suggestions you may have would be appreciated!


(Jay Modi) #14

Interesting. How are you starting logstash? I downloaded a copy of the 1.5.2 tar and created a simple config file, logstash-simple.conf:

input { stdin { } }
output {
  elasticsearch {
    protocol => http
    host => localhost
    user => admin
    password => ...
    index => 'logstash-%{+YYYY.MM.dd}'
  }
  stdout { codec => rubydebug }
}

Then I ran:

bin/logstash -f logstash-simple.conf

Added some input and all seems well:

Logstash startup completed
foobar here
{
       "message" => "foobar here",
      "@version" => "1",
    "@timestamp" => "2015-07-14T11:14:27.763Z",
          "host" => "..."
}

(Tim Dunphy) #15

Hey Jaynode,

I actually found what the problem is. Thanks for sticking with this and sorry it took me so long to find this out! :frowning:

The intention was to have each part of my logstash config living in a different file inside of conf.d for clarity of setup. However in my first file that was intended to just hold the input also had an additional output section that was not configured to work with shield.

output {

  stdout {
        codec => rubydebug
    }

  elasticsearch {
    host => "localhost"
  }
}

And apparently Logstash was glomming onto that output because it was the first one it saw, and failing to communicate with Elasticsearch. Once I eliminated that section from the config I was able to get LS to start writing to ES with shield in place.

#curl --user admin http://localhost:9200/_cat/indices | head -5
Enter host password for user 'admin':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1674  100  1674    0     0   3841      0 --:--:-- --:--:-- --:--:--  3848
green open logstash-2015.07.03 5 1    3046 0   3.4mb   1.7mb
green open logstash-2015.07.04 5 1    1063 0   1.8mb 981.5kb
green open logstash-2015.07.12 5 1   98083 0 251.1mb 118.7mb
green open logstash-2015.06.23 5 1     922 0   3.5mb   1.7mb
green open logstash-2015.06.29 5 1    4327 0   6.9mb   3.5mb

Thanks again for all the help!!


(system) #16