Elastic Search 6.8.6 breaks abruptly!

Hi,

I have been facing a strange error where the Elastic search instance breaks abruptly . ELK stack works fine initially, I am able to push data against index as well but the stack later breaks with following error, the logs of Elastic server and kibana are as follows. Looking for some guidance here.

Elastic Search logs:

[2020-10-26T15:56:16,478][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [xx-xxxx-xxxx.nam.nsroot.net] [logstash-2020.10.26][1]: failed to list shard for shard_started on node [rbumnqK6SsCzxoKABaorZA]
org.elasticsearch.action.FailedNodeException: Failed node [rbumnqK6SsCzxoKABaorZA]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:236) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:151) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:210) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1114) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1226)
...................
Caused by: org.elasticsearch.transport.RemoteTransportException: [xx-xxxx-xxxx.nam.nsroot.net][10.332.22.123:9300][internal:gateway/local/started_shards[n]]
Caused by: org.elasticsearch.ElasticsearchException: failed to load started shards
at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:169) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperat
.................
... 22 more
Caused by: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2020-10-26T19:33:39Z, (lock=NativeFSLock(path=/data/elasticsearch/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2020-10-26T15:33:16Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191) ~[lucene-core-7.7.2.jar:7.7.2 d4c30fc2856154f2c1fefc589eb7cd070a415b94 - janhoy - 2019-05-28 23:30:25]

Kibana logs:

{"type":"log","@timestamp":"2020-10-26T19:59:44Z","tags":["error","task_manager"],"pid":32410,"message":"Failed to poll for work: [security_exception] failed to authenticate user [kibana], with { header={ WWW-Authenticate={ 0="Bearer realm=\"security\"" & 1="ApiKey" & 2="Basic realm=\"security\" charset=\"UTF-8\"" } } } :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":401,"response":"{\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"failed to authenticate user [kibana]\",\"header\":{\"WWW-Authenticate\":[\"Bearer realm=\\\"security\\\"\",\"ApiKey\",\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\"]}}],\"type\":\"security_exception\",\"reason\":\"failed to authenticate user [kibana]\",\"header\":{\"WWW-Authenticate\":[\"Bearer realm=\\\"security\\\"\",\"ApiKey\",\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\"]}},\"status\":401}","wwwAuthenticateDirective":"Bearer realm=\"security\", ApiKey, Basic realm=\"security\" charset=\"UTF-8\""}"}

I have tried reinstalling the ELK several times but the setup broke each time abruptly with same error.

Thanks in advance !

Something (not Elasticsearch) is modifying the last-modified time of this file. Elasticsearch treats this as an indication that it does not have exclusive control over its data path, which can lead to data corruption, and therefore it stops working to protect your data.

The fix is to track down whatever else is altering things in the data path and prevent it from doing so. It's vitally important that Elasticsearch alone is permitted to alter the contents of its data path.

@DavidTurner, Thanks for the input. I was able to resolve my problem using your hint.
The problem with my setup was that I has got two nodes in my cluster having /data directory mounted on both nodes. As a result when I was going to install ES server on one server, it was breaking or corrupting the file of ES running on another server.

Thanks

Thanks, that roughly make sense, but does mean that (a) you're using some kind of network-based shared storage and (b) this storage does not implement file locking correctly. Local storage is generally recommended over shared storage: it performs better and tends not to have this kind of correctness issue. Bug-free file locking isn't terribly important to Elasticsearch (except to protect against this kind of setup issue), but would make me worry that it doesn't implement other more important filesystem features correctly too. Tread carefully.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.