License validation seems to take for very long

Hi everyone,
we are currently running a cluster 5.6 with 5 nodes, 4 mdi and 1 coordinator, and a valid xpack license.
When we restart a node it takes forever (about 8minutes) until the license is validated and node is allowed to find the master. This raises another problem, Authentication to realm ldap1 failed due to invalid credentials.

[2019-06-17T19:47:25,999][WARN ][o.e.x.s.a.AuthenticationService] [worker2] Authentication to realm ldap1 failed - authenticate failed (Caused by LDAPException(resultCode=49 (invalid credentials), errorMessage='invalid credentials'))
[2019-06-17T19:49:27,311][INFO ][o.e.l.LicenseService     ] [worker2] license [license ID] mode [platinum] - valid
[2019-06-17T19:49:27,344][WARN ][o.e.c.s.ClusterService   ] [worker2] cluster state update task [zen-disco-receive(from master [master {worker3}{ID}{ID}{IP}{IP}{ml.max_open_jobs=10, ml.enabled=true} committed version [7661]])] took [8.1m] above the warn threshold of 30s

Is there a reason for that?
If more information about our setup is need feel free to ask and I'll provide as much as I can.

Thank you very much,

Given you have a platinum level license you should reach out to your Support team for assistance here :slight_smile:

Hi Mark,
I tried to get in touch with support, but apparently we have a weird (startup) license, that comes with no support. I guess we get all the perks from the software but none of the elasticsearch expertise! :slight_smile:

Anyway, if you have any idea about where I should start looking to find out why it takes so long to validate the license that would be great.

I tried on an aws cluster with the non-production license, and it always goes quite fast to validate. However those machines are not very busy.

Thanks,

@TimV might have an idea?

This description seems to imply a causation that is the reverse of reality.
The license is stored in the cluster state, so a newly started node cannot activate the license until it connects to the master and receives an up to date copy of license state.

That is, your cluster formation isn't being delayed due to license checks, it's that the license checks are delayed because cluster formation is slow.

This could be caused by a variety of issues, but my guess is that this is because your native realm is unavailable when the node is disconnected from the cluster (because the security index is not available), so authentication requests that should be handled by the native realm are falling through to your LDAP realm, and failing.

This is the issue that we really need to solve, but it doesn't look like we have much information to go on.

What sort of network do you have between the nodes in your cluster?

Ping @DavidTurner in case he has some suggestions (but bear in mind that 5.6 is getting old).

Given that cluster updates seem slow I would first look at how many indices and shards you have in the cluster. Having too many shards often leads to this type of problems. How many shards do you have in your cluster? What is the hardware specification of the nodes in the cluster?

5.6 is EOL - https://www.elastic.co/support/eol

Hey, yes, ≥ 8 minutes to apply a cluster state looks like the problem here. I would like you to set the following setting in elasticsearch.yml on the problematic node:

logger.org.elasticsearch.cluster.service: TRACE

Then restart the node, wait for it to join the cluster, and then provide all the logs that it emitted since it started up. If you need to redact any information then please make it obvious that it's been redacted. It'd be useful if you didn't redact the node IDs as you did in the OP. These IDs are randomly-generated and contain no information except their identity.

I also echo @Christian_Dahlqvist's questions about the number of indices and shards in this cluster, and @warkolm's point that 5.6 is past the end of its supported life and you should be working towards upgrading to a supported version as a matter of some urgency.

We have 5 nodes in 4 different machines. Each node elastic node is a mdi except for the coordinating node. Each machine 128GB of RAM, 32GB for heap for mdi and 4 for the coordinating. In general each machine has 30 GB of free RAM. On the CPU side, all machines have 32 CPUs.
The network between between the nodes is 10Gb (fiber?).

We have a total of 3220 indices and 5240 shards, but of those 3220, 1800 are closed and 500 are special.

We added the coordinating node, as a temporary solution while we where enabling xpack security.

We have started planning how to migrate to 7.1.0, but it will take us some months to get there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.