ES Node Disconnects after enablign Shield

Hi Team,
Not sure where this topic belongs Shield or Elasticsearch.

We have ELK up and running (1 master, 1 client and 4 data nodes) and now we implemented Shield (Active Directory), this is also working fine as far as user authentication and authorization is concern.

But, i am continuously facing a problem, Kibana is getting timedout. And reason being Elasticsearch cluster becomes RED. Then it auto recovers and becomes GREEN. It's going on regularly, obviously i can query only when ES is not RED.

ES Logs says node discovery is getting timed out.

PFB elasticsearch discovery settings, rest settings are default:

--------------------------------- Discovery ----------------------------------

discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["..1.55","..1.254","..11.99", "..11.98","..1.134","..1.50"]

ES Master node log message is a

ttached.

Another issue is: "failed to execute bulk item (index) index" where as my logstash role has all access:

POST /_shield/role/logstash
{
"cluster": ["all"],
"indices": [
{
"names": [ "" ],
"privileges": ["
"]
}
]
}

Thanks & Regards

Please don't use screenshots for logs; it is really hard to parse. I'd recommend putting logs in a gist or using pastebin. It looks like you are hitting timeouts on node stats actions. Are your nodes overloaded? Do you have GC issues or anything like that.

I got a 5000 word limit restriction so i quickly uploaded snapshot.

Nodes may be overloaded due to data volume, what can be done here? But node's CPU (< 20) and Mem (<60) utilization is low, adding more node will help?

I don't think it's GC issue without Shiled everything works perfectly. I am suspecting, nodes are pinging each other to collect status and this activity is getting delayed because of shield, some authorization delay may be?

I validate system_key and also enabled shield audit, it doesn't say any issue.

[2016-10-25 13:03:08,791] [es-master-node] [transport] [access_granted] origin_type=[local_node], origin_address=[..1.55], principal=[__marvel_user], action=[cluster:monitor/nodes/stats]
[2016-10-25 13:03:08,791] [es-master-node] [transport] [access_granted] origin_type=[local_node], origin_address=[..1.55], principal=[__marvel_user], action=[cluster:monitor/nodes/stats[n]]
[2016-10-25 13:03:08,792] [es-master-node] [transport] [access_granted] origin_type=[local_node], origin_address=[..1.55], principal=[__marvel_user], action=[indices:data/write/bulk]
[2016-10-25 13:03:08,793] [es-master-node] [transport] [access_granted] origin_type=[local_node], origin_address=[..1.55], principal=[__marvel_user], action=[indices:data/write/bulk[s]], indices=[.marvel-es-data-1,.marvel-es-data-1,.marvel-es-data-1,.marvel-es-data-1,.marvel-es-data-1]
[2016-10-25 13:03:09,033] [es-master-node] [transport] [tampered_request] origin_type=[transport], origin_address=[..11.99], action=[internal:discovery/zen/unicast]
[2016-10-25 13:03:09,119] [es-master-node] [rest] [anonymous_access_denied] origin_address=[..1.71], uri=[/_bulk]
[2016-10-25 13:03:09,409] [es-master-node] [transport] [access_granted] origin_type=[transport], origin_address=[..1.254], principal=[kibana-admin], action=[cluster:monitor/health], indices=[.kibana]
[2016-10-25 13:03:09,411] [es-master-node] [transport] [access_granted] origin_type=[transport], origin_address=[..1.254], principal=[kibana-admin], action=[cluster:monitor/health], indices=[.kibana]
[2016-10-25 13:03:09,570] [es-master-node] [transport] [tampered_request] origin_type=[transport], origin_address=[..11.98], action=[internal:discovery/zen/unicast]
[2016-10-25 13:03:10,532] [es-master-node] [transport] [tampered_request] origin_type=[transport], origin_address=[..11.99], action=[internal:discovery/zen/unicast]
[2016-10-25 13:03:11,059] [es-master-node] [rest] [anonymous_access_denied] origin_address=[..1.71], uri=[/_nodes/http]
[2016-10-25 13:03:11,071] [es-master-node] [transport] [tampered_request] origin_type=[transport], origin_address=[..11.98], action=[internal:discovery/zen/unicast]
[2016-10-25 13:03:11,130] [es-master-node] [rest] [anonymous_access_denied] origin_address=[..1.71], uri=[/_bulk]

Thanks & Regards,

System key is the same on all of the nodes? It doesn't look like it is. The tampered request messages are indicative of a bad signature/system key mismatch:

[2016-10-25 13:03:09,033] [es-master-node] [transport] [tampered_request] origin_type=[transport], origin_address=[..11.99], action=[internal:discovery/zen/unicast]

:slight_smile:

Yes, system key is same on all nodes but i forgot to change file permission on two of the nodes. Still i am facing timeout issue on Kibana but i believe it's due to ES performance not shiled as no such error. I will debug that more.

Thanks Jay...

Hi Jay,
Kibana is still getting timed out, please suggest why "cluster:monitor/nodes/stats[n]" is getting timed out? After shield implementation elaticsearch is using which role for internal node stats?

I am using Active directory integration and created user/role for Logstash and Kibana but i have not done anything for ES yet, no user created with esuser.

PFB log from ES-Master node:
[2016-10-26 10:21:48,645][INFO ][cluster.metadata ] [es-master-node] [myproject-dummy-json-log-2016.10.26] creating index, cause [auto(bulk api)], templates [], shards [2]/[1], mappings [json-log]
[2016-10-26 10:21:48,789][INFO ][cluster.routing.allocation] [es-master-node] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[myproject-dummy-json-log-2016.10.26][0], [myproject-dummy-json-log-2016.10.26][1]] ...]).
[2016-10-26 10:21:48,953][INFO ][cluster.routing.allocation] [es-master-node] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[myproject-dummy-json-log-2016.10.26][1], [myproject-dummy-json-log-2016.10.26][0]] ...]).
[2016-10-26 10:21:49,102][INFO ][cluster.metadata ] [es-master-node] [myproject-dummy-json-log-2016.10.26] update_mapping [json-log]
[2016-10-26 10:21:49,295][INFO ][cluster.metadata ] [es-master-node] [filebeat-2016.10.26] create_mapping [json-log]
[2016-10-26 10:23:07,688][DEBUG][action.admin.cluster.node.stats] [es-master-node] failed to execute on node [hNXte3tURoa7houS_YjYRw]
ReceiveTimeoutTransportException[[es-data-node-4][..11.98:9300][cluster:monitor/nodes/stats[n]] request_id [628949] timed out after [15000ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2016-10-26 10:23:21,715][WARN ][shield.transport ] [es-master-node] Received response for a request that has timed out, sent [29027ms] ago, timed out [14027ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-node-4}{hNXte3tURoa7houS_YjYRw}{..11.98}{..11.98:9300}{master=false}], id [628949]

Thanks & Regards