Elasticsearch will not start "error loading whitelist(s)"

My cluster ran low on space and ended up red due to unallocated shards. After cleaning out some data and doing my best to get the shards all allocated again, I tried restarting the node with the most disk space.

Now it will not start for me.

After trying several things, and seeing several different messages, I tried upgrading, and then just a full reinstall.

Along the way I saw this message a lot:

fatal error in thread [main], exiting
java.util.ServiceConfigurationError: Service lookup key set is inconsistent with original name set!

After trying to reinstall, that message doesn't show up.

Now my cluster log shows this when I try to start:

I am configuring this via the official ansible elasticsearch module checked out to the 7.8.0 tag.

My host vm is running Ubuntu 18.04.

Data and logs are stored on an iscsi mount.

Of note, when running the ansible role, I do see an error in the elasticsearch : Remove key password task. Then it tries to start ES and fails.

TASK [elasticsearch : Remove key password] **************************************************************************************************************************************************
failed: [esnode3.example.org] (item=http) => {"ansible_loop_var": "item", "changed": true, "cmd": "/usr/share/elasticsearch/bin/elasticsearch-keystore remove 'xpack.security.http.ssl.secure_key_passphrase'", "delta": "0:00:01.276038", "end": "2020-08-17 12:53:40.845536", "item": "http", "msg": "non-zero return code", "rc": 78, "start": "2020-08-17 12:53:39.569498", "stderr": "ERROR: Setting [xpack.security.http.ssl.secure_key_passphrase] does not exist in the keystore.", "stderr_lines": ["ERROR: Setting [xpack.security.http.ssl.secure_key_passphrase] does not exist in the keystore."], "stdout": "", "stdout_lines": []}
failed: [esnode3.example.org] (item=transport) => {"ansible_loop_var": "item", "changed": true, "cmd": "/usr/share/elasticsearch/bin/elasticsearch-keystore remove 'xpack.security.transport.ssl.secure_key_passphrase'", "delta": "0:00:01.346804", "end": "2020-08-17 12:53:43.021195", "item": "transport", "msg": "non-zero return code", "rc": 78, "start": "2020-08-17 12:53:41.674391", "stderr": "ERROR: Setting [xpack.security.transport.ssl.secure_key_passphrase] does not exist in the keystore.", "stderr_lines": ["ERROR: Setting [xpack.security.transport.ssl.secure_key_passphrase] does not exist in the keystore."], "stdout": "", "stdout_lines": []}
...ignoring

RUNNING HANDLER [elasticsearch : restart elasticsearch] *************************************************************************************************************************************
fatal: [esnode3.example.org]: FAILED! => {"changed": false, "msg": "Unable to start service elasticsearch: Job for elasticsearch.service failed because the control process exited with error code.\nSee \"systemctl status elasticsearch.service\" and \"journalctl -xe\" for details.\n"}

I have done a good bit of googling and haven't found much.

Any help would be appreciated.

Thanks!

Edit:

The message I thought went away after reinstall didn't. It's from SystemD, not ES.

From syslog:

Aug 17 13:21:57 esnode3 systemd[1]: Starting Elasticsearch...
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: fatal error in thread [main], exiting
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: java.util.ServiceConfigurationError: Service lookup key set is inconsistent with original name set!
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.apache.lucene.analysis.util.AnalysisSPILoader.reload(AnalysisSPILoader.java:133)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.apache.lucene.analysis.util.TokenFilterFactory.reloadTokenFilters(TokenFilterFactory.java:73)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.plugins.PluginsService.reloadLuceneSPI(PluginsService.java:577)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.plugins.PluginsService.loadBundle(PluginsService.java:551)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:473)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:164)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.node.Node.<init>(Node.java:317)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.node.Node.<init>(Node.java:266)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.cli.Command.main(Command.java:90)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Aug 17 13:22:06 esnode3 systemd-entrypoint[22042]: #011at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Aug 17 13:22:07 esnode3 systemd-entrypoint[22042]: 2020-08-17 20:22:07,361332 UTC [22246] ERROR CNamedPipeFactory.cc@226 Unable to open named pipe /tmp/elasticsearch-11145045001933063665/controller_log_22042 for writing: Interrupted system call
Aug 17 13:22:07 esnode3 systemd-entrypoint[22042]: 2020-08-17 20:22:07,362068 UTC [22246] ERROR CLogger.cc@298 Cannot log to named pipe /tmp/elasticsearch-11145045001933063665/controller_log_22042 as it could not be opened for writing
Aug 17 13:22:07 esnode3 systemd-entrypoint[22042]: 2020-08-17 20:22:07,362102 UTC [22246] FATAL Main.cc@102 Could not reconfigure logging
Aug 17 13:22:07 esnode3 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 13:22:07 esnode3 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Aug 17 13:22:07 esnode3 systemd[1]: Failed to start Elasticsearch.

Is there any more to your log after the bootstrap error?

Nope. Those are all the logs I see, from when I try to start, to when it fails.

Ok, it sounds like something has gone astray here then.
Did you change any settings during the install process?

I ran the Ansible role with the exact same settings I had used when it was working.

Unfortunately my vpn kicked me off in the middle of troubleshooting (not when I was actually doing anything) so I lost my terminal history from early in the day. So here's what I remember:

  • Cluster was in red status due to unassigned shards.
  • I cleaned out some space so the cluster would have enough room to allocate shards.
  • I reset the Retry Failed counter a couple times.
  • A few indices were corrupted enough that I deleted them, according to /_cluster/allocation/explain?pretty
  • When /_cat/health showed that no more tasks were appearing, and there were lots of shards still unallocated, I tried to restart Elasticsearch on the node with the most disk space.
  • That is when I first saw the systemd error message I added to my original post.
  • After some searching, I tried adjusting file permissions on my data and log directories.
  • I think I then tried running the Ansible role without trying to upgrade.
  • Then I tried upgrading to 7.8.1 via the Ansible role.
  • When that also failed to start, I tried removing ES from the node entirely, and reinstalling 7.8.0 via the Ansible role.
  • The only setting I ever changed was es_version in the Ansible role when I tried upgrading. I put it back to 7.8.0 after. Other than that, the settings were all exactly the same as when I had upgraded to 7.8.0 weeks ago.

That give any clues?

I'd really like to fix this so I can get 7.9 installed. Anyone have any ideas?

Is the node that has this error the one that ran out of space, the node that started to upgrade/was upgraded and then downgraded, the node that you made directory permission changes to, or none of those?

The node the error is on it the node with the most space. Let's call it node3.

Node3 also is the one I tried upgrading. After the upgrade, it failed to start. So I downgraded.

I did make directory permission changes on it as well. If I recall correctly I set /mnt/elasticsearch/{data,logs} to 755. Not recursively. When I ran the Ansible role again, it reset the permissions to something else.

That's quite possibly the issue, do you have logs from that time?

I'll see if I can dig them up.

But, well, I was running into these same errors before I tried upgrading. At least I think...

Hmm.... I should look up a tool to record troubleshooting issues with...

Ok, here's a gist of the logs from the timeperioud where I tried to upgrade.

Note the Service lookup key set is inconsistent with original name set! error is present when trying to run 7.8.0 before the upgrade attempt, while trying to get 7.8.1 started, and then again when we're back to 7.8.0.

I'd share the full, sanitized, log file, but pasting that much text into gist isn't working. I'll see if I can find another way.

@warkolm Can I private message you a link to a GSuite file with all the logs in it? If it would be helpful?

I've asked someone else to take a look here, it's outside my scope of knowledge.

1 Like

Ok, so I was able to manually allocate the replica shards to the working nodes.

That still leaves 75 primary shards unavailable.

Of course, somewhere along the line, these shards ended up with no replicas. (I have limited disk space, so I've been fairly aggressive with my ilm policies, and removing replicas was one way to clean up when allocation failed due to low diskspace...)

Some are old enough I don't care if I lose them. But if I'm ever going to bring ES into production, I really shouldn't have lost data like this....

Anyway, is there a way to manually move the data from one node to another that would let me get these shards working one of my good nodes?

If I'm reading the allocation docs right, forcing the issue will result in data loss. Since I'll be telling ES to activate primary shards on nodes that don't have the data for those shards.

Edit:

And now another wrinkle has appeared... Somehow ES was updated to 7.9 on both the working 2 nodes, and the not working node.

I thought the Ansible role pinned the version so OS updates wouldn't upgrade it...

Anyway, still get the same errors on node3 as before.

journalctl -l -u elasticsearch:


-- Logs begin at Thu 2019-12-19 06:57:42 PST, end at Tue 2020-08-25 12:52:12 PDT. --
Apr 28 14:07:26 node3 systemd[1]: Starting Elasticsearch...
Apr 28 14:07:27 node3 elasticsearch[1838]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future releas
Apr 28 14:07:51 node3 systemd[1]: Started Elasticsearch.
Apr 28 14:41:53 node3 systemd[1]: Stopping Elasticsearch...
Apr 28 14:41:54 node3 systemd[1]: Stopped Elasticsearch.
Apr 28 14:41:54 node3 systemd[1]: Starting Elasticsearch...
Apr 28 14:41:55 node3 elasticsearch[14176]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future relea
Apr 28 14:42:14 node3 elasticsearch[14176]: ERROR: [1] bootstrap checks failed
Apr 28 14:42:14 node3 elasticsearch[14176]: [1]: memory locking requested for elasticsearch process but memory is not locked
Apr 28 14:42:14 node3 elasticsearch[14176]: ERROR: Elasticsearch did not exit normally - check the logs at /mnt/elasticsearch_iscsi/node3/logs/clustername.log
Apr 28 14:42:15 node3 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=78/n/a
Apr 28 14:42:15 node3 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Apr 28 14:42:15 node3 systemd[1]: Failed to start Elasticsearch.
Apr 28 14:44:46 node3 systemd[1]: Starting Elasticsearch...
Apr 28 14:44:46 node3 elasticsearch[14470]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future relea
Apr 28 14:45:36 node3 systemd[1]: Started Elasticsearch.
Apr 28 15:22:35 node3 systemd[1]: Stopping Elasticsearch...
Apr 28 15:22:36 node3 systemd[1]: Stopped Elasticsearch.
Apr 28 15:22:36 node3 systemd[1]: Starting Elasticsearch...
Apr 28 15:22:37 node3 elasticsearch[16533]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future relea
Apr 28 15:22:59 node3 systemd[1]: Started Elasticsearch.
Apr 28 15:27:47 node3 systemd[1]: Stopping Elasticsearch...
Apr 28 15:27:48 node3 systemd[1]: Stopped Elasticsearch.
Apr 28 15:27:48 node3 systemd[1]: Starting Elasticsearch...
Apr 28 15:27:49 node3 elasticsearch[17943]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future relea
...skipping...
Aug 25 12:51:23 node3 systemd[1]: Starting Elasticsearch...
Aug 25 12:51:33 node3 systemd-entrypoint[22418]: fatal error in thread [main], exiting
Aug 25 12:51:33 node3 systemd-entrypoint[22418]: java.util.ServiceConfigurationError: Service lookup key set is inconsistent with original name set!
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.apache.lucene.analysis.util.AnalysisSPILoader.reload(AnalysisSPILoader.java:133)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.apache.lucene.analysis.util.TokenFilterFactory.reloadTokenFilters(TokenFilterFactory.java:73)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.plugins.PluginsService.reloadLuceneSPI(PluginsService.java:661)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.plugins.PluginsService.loadBundle(PluginsService.java:634)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:473)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:165)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.node.Node.<init>(Node.java:328)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.node.Node.<init>(Node.java:277)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
Aug 25 12:51:33 node3 systemd-entrypoint[22418]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Aug 25 12:51:34 node3 systemd-entrypoint[22418]: 2020-08-25 19:51:34,069861 UTC [22626] ERROR CLogger.cc@310 Cannot log to named pipe /tmp/elasticsearch-4931750423922732067/contr
Aug 25 12:51:34 node3 systemd-entrypoint[22418]: 2020-08-25 19:51:34,069963 UTC [22626] INFO  Main.cc@103 Parent process died - ML controller exiting
Aug 25 12:51:34 node3 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Aug 25 12:51:34 node3 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Aug 25 12:51:34 node3 systemd[1]: Failed to start Elasticsearch.

clustername.log:

[2020-08-25T12:51:32,567][INFO ][o.e.n.Node               ] [node3.example.org] version[7.9.0], pid[22418], build[default/deb/a479a2a7fce0389512d6a9361301708b92dff667/2020-08-11T21:36:48.204330Z], OS[Linux/4.15.0-112-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/14.0.1/14.0.1+7]
[2020-08-25T12:51:32,592][INFO ][o.e.n.Node               ] [node3.example.org] JVM home [/usr/share/elasticsearch/jdk]
[2020-08-25T12:51:32,593][INFO ][o.e.n.Node               ] [node3.example.org] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms6g, -Xmx6g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-4931750423922732067, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/mnt/elasticsearch_iscsi/testelk03/logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/mnt/elasticsearch_iscsi/testelk03/logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -XX:MaxDirectMemorySize=3221225472, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=deb, -Des.bundled_jdk=true]
[2020-08-25T12:51:33,038][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node3.example.org] fatal error in thread [main], exiting
java.util.ServiceConfigurationError: Service lookup key set is inconsistent with original name set!
        at org.apache.lucene.analysis.util.AnalysisSPILoader.reload(AnalysisSPILoader.java:133) ~[lucene-analyzers-common-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:47:16]
        at org.apache.lucene.analysis.util.TokenFilterFactory.reloadTokenFilters(TokenFilterFactory.java:73) ~[lucene-analyzers-common-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:47:16]
        at org.elasticsearch.plugins.PluginsService.reloadLuceneSPI(PluginsService.java:661) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.plugins.PluginsService.loadBundle(PluginsService.java:634) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:473) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:165) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.node.Node.<init>(Node.java:328) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.node.Node.<init>(Node.java:277) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) ~[elasticsearch-cli-7.9.0.jar:7.9.0]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.9.0.jar:7.9.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.9.0.jar:7.9.0]

Found where the "inconsistent" error message likely originates:

Without replicas, those shards are lost unfortunately.

This is not really a good thing, especially if it upgrades, starts the process, then you try to downgrade.

Even though the actual data is still on node3's mount?

If I could recover node3, I'd think it'd pick up the data.

Ah well, this is why we're testing and not in production.

I guess the thing to do is delete the dead indices, clean out node3's data, then make node3 start like it's a new node. Right?

Anyway, is there a way to manually move the data from one node to another that would let me get these shards working one of my good nodes?

If I'm reading the allocation docs right, forcing the issue will result in data loss. Since I'll be telling ES to activate primary shards on nodes that don't have the data for those shards.

Ok, this topic is officially over. I've killed all the indices that are missing data, and my third node is up and running.

My take away? Elasticsearch is fragile when you are running minimum specs, and you really need to set up snapshots to back your data up.

Please create your own topic for this.