After a power failure, something in my Elasticsearch database has become corrupted and the eleasticsearch service refuses to work. The error in the error log is:
[2017-05-26T14:28:33,318][WARN ][o.e.c.a.s.ShardStateAction] [vBU9z3G] [logstash-2017.04.09][0] unexpected failure while sending request [internal:cluster/shard/failure] to [{vBU9z3G}{vBU9z3GiTnOMVrOfRaHi8w}{hu8A1SqwQBuYLrkkLn4dTA}{127.0.0.1}{127.0.0.1:9300}] for shard entry [shard id [[logstash-2017.04.09][0]], allocation id [HIfOvHOIRSmh2int2wZw3g], primary term [0], message [shard failure, reason [failed to recover from translog]], failure [EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [122748612] length: [4] end: [122748612]]; ]]
org.elasticsearch.transport.SendRequestTransportException: [vBU9z3G][127.0.0.1:9300][internal:cluster/shard/failure]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:579) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:502) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:477) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.action.shard.ShardStateAction.sendShardAction(ShardStateAction.java:104) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.action.shard.ShardStateAction.shardFailed(ShardStateAction.java:169) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.action.shard.ShardStateAction.localShardFailed(ShardStateAction.java:163) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.sendFailShard(IndicesClusterStateService.java:681) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.failAndRemoveShard(IndicesClusterStateService.java:671) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.access$1100(IndicesClusterStateService.java:91) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService$FailedShardHandler.lambda$handle$0(IndicesClusterStateService.java:700) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:563) ~[elasticsearch-5.4.0.jar:5.4.0]
... 13 more
I guess something has become corrupted somewhere. I don't really care about the data and if any of it is lost - but I want to get the full ELK stack back up and running ASAP. Any ideas what to do, which file to delete, etc?
I'm running the latest version (5.4.0) on a Linux Mint machine. Processing data only from a single source on the same machine.
The whole of it? Wouldn't that, like, remove the whole database? It's not a huge problem - I have a backup of the logs and can re-create the database by feeding them to Logstash via netcat, but it's five months worth of data and would probably take days or even weeks, so I'd rather avoid it, if I can.
Can't I just remove the corrupted translog and lose only the stuff not committed to the database? At least this is what the various articles I could find by googling this problem suggest doing, but none of them explains how exactly to find the thing I need to delete. They either refer to paths that don't exist and were probably relevant in earlier versions of Elasticsearch, or use some hardcoded random-looking sequences of letters that would, obviously, be different in each case.
No need to stop the service, BTW, since it isn't running. It can't start (or, more exactly, crashes soon after starting), due to this error.
Yes, the articles I've found so far suggest that this is exactly the way to proceed; it causes minimum data loss. This is what I want to do, too. The problem is, I have no idea how to find the corrupted index. Can it be determined from the error message and how exactly?
There are no subdirectories named "vBU9z3GiTnOMVrOfRaHi8w", "hu8A1SqwQBuYLrkkLn4dTA", or "HIfOvHOIRSmh2int2wZw3g", either.
Can I get a more competent answer, please? I really need to get the service up and running ASAP.
I've found this article, which talks about truncating a faulty translog, but it uses the magic value "P45vf_YQRhqjfwLMUvSqDw" with absolutely no explanation where it is derived from!
Starting with 5.x Elasticsearch​ won't use index name and cluster name as directory names to store indices on disk, that's why you are seeing these hash-like indices in /var/lib/elasticsearch.
The exception you have posted is not indicating any corruption to me, what made you conclude this? Have you tried simply restarting Elasticsearch?
Yes, I have tried restarting ES. After a while, it prints this error in the error log and terminates. Googling around suggested that this error occurs when the translog is corrupted.
The full log is 170 Kb; I cannot paste it in a message here; there is a restriction how large a message can be. I've put it on Pastebin, there.
I am not worried about lost data; I have the original logs, the data from which is in the database, and will feed them to Logstash manually, if necessary. The only thing I'll lose is time. But if the database gets corrupted every time there is a power failure and my only recourse is to delete the database and re-create it from scratch, that's simply not acceptable.
I saw your first post on my mobile, so I guess I missed part of the exception message that you have initially posted. I have seen the full exception message and I do agree that the translog is corrupted. I am sorry.
It is very unlikely that the translog gets corrupted due to power failure. Are you using RAID for your disks?
Lastly, a failed recover should not block a node from starting up. The node should be able to startup with failed indices unavailable. The last lines of the log seems to be indicating that the node was stopped externally. Did you stop the service?
No. It's a VirtualBox VM running Linux Mint on a Windows 10 host.
In this case - yes, because the log kept getting filled with this message. Also Kibana says that the Elasticsearch plugin is red, and Logstash's log keeps getting filled with errors that it couldn't connect to the Elasticsearch instance. For all practical purposes, ES isn't working.
So, you are saying that my only recourse is to delete the whole /var/lib/elasticsearch/nodes/ tree, re-create the database from scratch with the original data and hope that it doesn't happen again?
I don't know if there is a way to recover without deleting. Maybe @jasontedor has more to say.
Also you really shouldn't be running a production cluster is such environment since we don't test with that configuration. I would say that the root cause of your corruption lies in your environment and not in Elasticsearch.
Also Windows 10 is not a supported OS neither as host nor as guest. See our Support Matrix for what is supported.
I highly recommend you to move away from this environment as you may hit the same problem again and it won't be Elasticsearch's fault.
It's just a research project, nothing mission-critical.
The Win10 host is for the Linux Mint virtual machine; not for ELK. ELK is running in Linux on that VM.
Linux Mint is basically Ubuntu 16.04, which seems supported, according to your matrix. Or are you saying that Elasticsearch is not able to work reliably on a VM at all? Aren't most of the cloud instances out there just different VMs? In any case, I can't afford to dedicate a separate physical Linux machine to this project (and one with lots of RAM, too, because ES really doesn't like environments with less than 4 Gb RAM).
I guess you didn't bother reading the whole thread. I've already read that article and it is useless, because it uses some "magic" string (P45vf_YQRhqjfwLMUvSqDw), which it totally fails to explain how to get from the error message or anywhere else.
This is not a good way to respond to someone that was only trying to help you, especially on a community forum for an open source project, but frankly ever. You establish yourself as an adversary, and you reduce the likelihood that someone else will wade in and try to collaborate with you on finding a solution.
Here's a suggestion for a better response for you in the future: "okay, how do I find the path to apply that to?"
I see no indication in this thread that you read that article. Your assertion that it is useless is a stretch as if you knew what path to apply the tool to then you could use the documentation on that page to get out of this mess. Therefore, I agree with you that the only challenge here is finding that path.
Let me try to help you with that. You have the index name. You can hit /_cat/indices?v. This will give you a response like:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open logstash-2017.04.09 qmDPqIHJTlmc-b9CB-gHSQ 5 1 0 0 985b 985b
That uuid is what you're looking for to get the path on disk. So in this case it would be nodes/0/indices/qmDPqIHJTlmc-b9CB-gHSQ/ relative to your data path.
Next, from shard id [[logstash-2017.04.09][0] we know that this is shard 0. Thus, the full path would be nodes/0/indices/qmDPqIHJTlmc-b9CB-gHSQ/0/translog so I would run:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.