Upgrading ES from 2.1.1 to 2.2.0 - missing authentication token?

Hi,

I decided to try upgrade the current cluster from ES2.1.1 to ES2.2.0.
A mirror pair. The cluster is running within AWS, so I'm using the cloud-aws plugin for communication.

I successfully upgraded the first node, and it has assumed master status, but I have encountered a strange communication/authentication issue when upgrading the second node.

I paid attention to the guidelines here, but I still seem to be experiencing a strange issue.

From main cluster log on 2nd node:

[2016-02-03 12:29:41,241][INFO ][discovery.ec2            ] [Sharon Ventura] failed to send join request to master [{Space Phantom}{NzN7b7ZHT8uPu6oXJAORMg}{10.60.164.147}{10.60.164.147:9300}], reason [RemoteTransportException[[Space Phantom][10.60.164.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[Sharon Ventura][10.60.163.74:9300][internal:discovery/zen/join/validate]]; nested: ElasticsearchSecurityException[missing authentication token for action [internal:discovery/zen/join/validate]]; ]
[2016-02-03 12:29:42,455][DEBUG][action.admin.cluster.health] [Sharon Ventura] no known master node, scheduling a retry
[2016-02-03 12:29:44,255][INFO ][discovery.ec2            ] [Sharon Ventura] failed to send join request to master [{Space Phantom}{NzN7b7ZHT8uPu6oXJAORMg}{10.60.164.147}{10.60.164.147:9300}], reason [RemoteTransportException[[Space Phantom][10.60.164.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[Sharon Ventura][10.60.163.74:9300][internal:discovery/zen/join/validate]]; nested: ElasticsearchSecurityException[missing authentication token for action [internal:discovery/zen/join/validate]]; ]
[2016-02-03 12:29:47,269][INFO ][discovery.ec2            ] [Sharon Ventura] failed to send join request to master [{Space Phantom}{NzN7b7ZHT8uPu6oXJAORMg}{10.60.164.147}{10.60.164.147:9300}], reason [RemoteTransportException[[Space Phantom][10.60.164.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[Sharon Ventura][10.60.163.74:9300][internal:discovery/zen/join/validate]]; nested: ElasticsearchSecurityException[missing authentication token for action [internal:discovery/zen/join/validate]]; ]
[2016-02-03 12:29:49,472][DEBUG][action.admin.cluster.state] [Sharon Ventura] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-02-03 12:29:49,473][INFO ][rest.suppressed          ] /_cluster/settings Params: {}
MasterNotDiscoveredException[null]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:205)
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:239)
        at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:794)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2016-02-03 12:29:50,283][INFO ][discovery.ec2            ] [Sharon Ventura] failed to send join request to master [{Space Phantom}{NzN7b7ZHT8uPu6oXJAORMg}{10.60.164.147}{10.60.164.147:9300}], reason [RemoteTransportException[[Space Phantom][10.60.164.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[Sharon Ventura][10.60.163.74:9300][internal:discovery/zen/join/validate]]; nested: ElasticsearchSecurityException[missing authentication token for action [internal:discovery/zen/join/validate]]; ]

My elasticsearch.yml file:

cluster.name: cluster01
http.cors.enabled: true
network.host: 0.0.0.0
discovery.type: ec2
discovery.ec2.tag.project_code_info: "cluster01"
cloud.aws.region: eu-central-1

I can see in the logs that it has detected the 1st node: [Space Phantom][10.60.164.147:9300]
It has detected it without any intervention, but it apparently cannot authenticate.

I suspect this may be related to the Shield plugin, which is installed also, but the correct and identical permissions are setup the same as before. Nothing else has changed.

I'm using a username and password in shield, no SSL configured.

Can anyone assist?

Any advice would be helpful.

Anyone?

Please be patient when waiting for an answer to your questions. This is a community forum and as such it may take some time before someone replies to your question. Not everyone on the forum is an expert in every area so you may need to wait for someone who knows about the area you are asking about to come online and have the time to look into your problem.

Please see https://www.elastic.co/community/codeofconduct for more details on our code of conduct (in particular the "be patient" section).

There are no SLAs on responses to questions posted on this forum, if you require help with an SLA on responses you should look into purchasing a subscription package that includes support with an SLA such as those offered by Elastic: https://www.elastic.co/subscriptions

I appreciate the nature of this forum, and I'm already aware of it.

I'm also aware that there are niche subjects that come up regularly, and they may not be easily answered.

Furthermore, I am already aware that this particular thread may be one of those. This isn't my first troubleshoot rodeo, and I'm not demanding anything from anyone. The hack-y nature of Elastics' products is repeatedly reminded me of this also.

I'm not asking for a solution, just some advice or pointers "Have you tried this?", "Have you looked at that?", etc.

I believe that is a pretty normal expectation of the environment of a community forum, not anyone in particular.

The fact is that you posted your question on Feb 3, 1:37 PM and asked:

Can anyone assist?

Then you added new comment on Feb 3, 1:49 PM (you removed this message recently and merged it with the first message) and asked:

Can anyone assist?

Then then you asked on Feb 3, 4:33 PM:

Anyone?

That's perfectly fine to ping on a question after one or two days without an answer. But here, it's less than 3 hours later... That's why I suggested to be patient.

Coming back to your question. It sounds like a shield problem indeed.
I wonder if something changed in Shield 2.2 vs 2.1 (in term of serialization for example).

Does a full cluster restart fixes your issue?

Thank you kindly for your verbose responses.

You have helped me immensely and created a friendly tone within this thread, so that others may feel more comfortable responding.

Have a nice day.

For anyone else who may run into this issue:

I managed to resolve this issue, by (on all nodes) resetting all settings and configurations, removing plugins license, shield, removing all users and re-adding all of them as before. These configurations were identical to begin with, so this is odd.

First, stop elasticsearch on all nodes.
Stop kibana if running locally.

If you have any custom roles, check the configuration of this in /etc/elasticsearch/shield/roles.yml
refresh this from a single recorded configuration if possible.

remove plugins:
/usr/share/elasticsearch/bin/plugin remove elasticsearch/license/latest
/usr/share/elasticsearch/bin/plugin remove elasticsearch/shield/latest
remove users:
/usr/share/elasticsearch/bin/shield/esusers userdel admin
/usr/share/elasticsearch/bin/shield/esusers userdel logstash

re-add plugins:
/usr/share/elasticsearch/bin/plugin install elasticsearch/license/latest -b
/usr/share/elasticsearch/bin/plugin install elasticsearch/shield/latest -b
re-add users:
/usr/share/elasticsearch/bin/shield/esusers useradd admin -p adminuserpw -r admin
/usr/share/elasticsearch/bin/shield/esusers useradd logstash -p logstashuserpw -r logstash

If you have any custom roles, double-check the configuration of this in /etc/elasticsearch/shield/roles.yml to verify the configuration is not been modified or over-written.

Start elasticsearch on first node.
Start kibana if running locally.

Check indices have come up correctly and verify master node status.

Do all the above steps on all other nodes.

Start elasticsearch on remaining nodes, one at a time.
Verify healthy cluster replication before starting next node.

I hope someone finds this useful.

1 Like