Corrupted translog


#1

After a power failure, something in my Elasticsearch database has become corrupted and the eleasticsearch service refuses to work. The error in the error log is:

[2017-05-26T14:28:33,318][WARN ][o.e.c.a.s.ShardStateAction] [vBU9z3G] [logstash-2017.04.09][0] unexpected failure while sending request [internal:cluster/shard/failure] to [{vBU9z3G}{vBU9z3GiTnOMVrOfRaHi8w}{hu8A1SqwQBuYLrkkLn4dTA}{127.0.0.1}{127.0.0.1:9300}] for shard entry [shard id [[logstash-2017.04.09][0]], allocation id [HIfOvHOIRSmh2int2wZw3g], primary term [0], message [shard failure, reason [failed to recover from translog]], failure [EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [122748612] length: [4] end: [122748612]]; ]]
org.elasticsearch.transport.SendRequestTransportException: [vBU9z3G][127.0.0.1:9300][internal:cluster/shard/failure]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:579) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:502) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:477) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.action.shard.ShardStateAction.sendShardAction(ShardStateAction.java:104) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.action.shard.ShardStateAction.shardFailed(ShardStateAction.java:169) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.action.shard.ShardStateAction.localShardFailed(ShardStateAction.java:163) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.sendFailShard(IndicesClusterStateService.java:681) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.failAndRemoveShard(IndicesClusterStateService.java:671) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.access$1100(IndicesClusterStateService.java:91) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService$FailedShardHandler.lambda$handle$0(IndicesClusterStateService.java:700) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.4.0.jar:5.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:563) ~[elasticsearch-5.4.0.jar:5.4.0]
	... 13 more

I guess something has become corrupted somewhere. I don't really care about the data and if any of it is lost - but I want to get the full ELK stack back up and running ASAP. Any ideas what to do, which file to delete, etc?

I'm running the latest version (5.4.0) on a Linux Mint machine. Processing data only from a single source on the same machine.


(Xavier Facq) #2

hi ,

Stop the service and remove the sub directory in : /var/lib/elasticsearch

Restart your service

bye,
Xavier


#3

The whole of it? Wouldn't that, like, remove the whole database? It's not a huge problem - I have a backup of the logs and can re-create the database by feeding them to Logstash via netcat, but it's five months worth of data and would probably take days or even weeks, so I'd rather avoid it, if I can.

Can't I just remove the corrupted translog and lose only the stuff not committed to the database? At least this is what the various articles I could find by googling this problem suggest doing, but none of them explains how exactly to find the thing I need to delete. They either refer to paths that don't exist and were probably relevant in earlier versions of Elasticsearch, or use some hardcoded random-looking sequences of letters that would, obviously, be different in each case.

No need to stop the service, BTW, since it isn't running. It can't start (or, more exactly, crashes soon after starting), due to this error.


(Xavier Facq) #4

I never try, but may be it's possible to go into the subdirectory and remove only the corrupted indice instead of removing all .


#5

Yes, the articles I've found so far suggest that this is exactly the way to proceed; it causes minimum data loss. This is what I want to do, too. The problem is, I have no idea how to find the corrupted index. Can it be determined from the error message and how exactly?


(Xavier Facq) #6

I would say something like:

/var/lib/elasticsearch/clustername/nodes/0/indices/logstash-2017.04.09

no guarantee !


#7

There is no logstash-2017.04.09 directory there! Here is the contents of the /var/lib/elasticsearch//nodes/0/indices/ directory:

vess-box indices #  pwd
/var/lib/elasticsearch/nodes/0/indices
vess-box indices # ls
0fTgRJ9YRu2du0eF8rqa8g  dixT4ZPiQNygK2gpOITBjQ  jzNj40TsT6-vZEhfvO6mCA  OEcSPxD_QweC_Sk8lLsBJw  ui36JbrbSeSrC3Kjz0C0kw
_2A6QOP0RaWwU6v3BydXbQ  DpkXYdYwT7O4GxrsyZjWaA  kBR5spMiSE2IiSLSrh956A  oYt58aEXRNuy7vYmEnM5zA  uKDcM58NTCe0BD4xs0Qfjw
393xvIlkQ7aVrz-O-tB8_g  eBdHQUqrTEa3smkO4hgfkg  k_m461uaRLqt90fFbvBuIw  _pmXZivJRgWEz_MoMeZlzA  UrTQxr9XSyGtk5myHm6d1w
-3FpWccKSRudiJGGpvMb3Q  enPKG08nRmGjy_CfmpIn5w  kRMDpAIzTSO5GlHbcFRdEQ  PQIJ3mBaTMKcrh7aEZK12g  UVVSuLUaR7-zK3MqdSCsWg
3HsV2725TfWDdQu13lIUeg  ewv1HZ95SUmnqYpKDH186A  L3gTd5BOTB26OcHuzrcqaQ  pQSnV3StTbe1QZvzre7CmQ  V0S6htstR_6TNJu_KdBbhQ
_3pAXKVyRQmbn-wJnvVwNQ  FPCf6WZfQ3Sb0zo7UP1CdA  lclPkl8_T7SKEfAwFkzo5A  Q1q90W-SQI-2WzRNlZCBfw  V4xeq1O8Tg67Mp7rm7JmoA
3Uvc7NfKQsOUlT5fpV1hsw  fXSuAct6TIKKem3gtuoTig  lf3uQZhNSiSt803hAuh9Ig  Q7jrq2TTRF6LSUX1R0YzDg  voPosLaATeGOojpX5kA9BQ
3Vkj39NnTUaaLrEOlsGK7g  GE0QJuirQlm5tApJevniGA  L-mzI8yzRXS9AP5EtM_bBg  qAi7Oxp7SP2-SzXvLJ6y8w  VuxW9oaFQguBhW7ZMp3Srw
3VlPnFBWTxa0N-kUUzQ1nw  GiF1Y_72RquHMavzLuxpyQ  LOvi_cvDRWatzAmH5DrqUQ  _QLAZSh1TlymX5cRp9-wrA  w8wx27txTJi1_0M-NkPCOw
6oPmtSibQ6uZ7K-pP7B18A  giVKB1FtQHKcVJ_6B3v8xg  lpD5ci2ZSLaPDWI8XyKSHA  QXJlYy8fQV2s6O7fFdfW3w  wCSiNKaoSVSjMvH4FAYV0g
6qbsEou5Q5G_87_SS5oMCg  Hbj6Yf6nRI23nUL5p2LtXQ  LTepI5irT8aEY5NQXrT9qA  RjiZxZezT5a2FPy9VdsIAA  wDcGf5QcRO6HDovk97q-0g
6zaFhKtbR3m9VUyb_O8PYA  HchwWq_aTGOgl_ld-JDhTA  MPmCME7KQ3mZAF68t8k2Gw  Rnd5lomDTXiJR257VW3_mw  wHBpedk1Q4ecDO1885LG2A
7MbliO2RT_SuodyvCBubUw  HhRzCSAHSfyBlivSZSyOIQ  Ms6iqStOTtykzMo76M3CmA  RpkjWv6iQKKv_jXFlrc45A  wkyRuwhBR4KRa8Qd0D9nAg
7oPpuTfnSNahOxR0dKi8Tw  HhyBCPEXRtm4kZNwkd5ykA  MtTH5WeWQFGKhoSMmqyy4g  RRRK08S_QpG77K0Fy_fpBw  wn029MpySUqAmaklF_y9gg
7q4_ACmhTpiifrdbdOmEnA  HNtvhpF7Q3ODkpiiulki5Q  M-_XUp6XQGiYaD4nlYU3-Q  Se0Ck6OgQHqq20a6XJLS6A  WPl1nce0TFimw8fXHfyJAA
_8r7owK2RI6sBr7ratqWUw  hRPUvacQQ2KpuL4HOoQe1A  MyylMv4cQBmFgIFiPVCr_Q  S-H3XYpNQ8yccJQsfaquXA  x01pt8MfQt-lULHX66QTYg
97KLuff8RT-zA8dhs2X2JA  i0XjiruaTgCICiAyV2UArA  N92MLDdsS0KxjkeoEsOXsw  SPdmOnYaS5a0PJAWzWSgMA  Xe5aeZlxRye_DhOc3TapZw
aqbmj3jpS6aWS3NGRHHAng  IFWBRd5jQ0mOJzRDL_g3jQ  nPoWdaWoRBGnWg3ouVteHg  SPPfZz71RYGLLm2wIR2Mpg  XjX_pMerT3uZdUAE01RU7g
aUhu3oviRCOCucnRY5KLrQ  IkQMNiuNTBSTob9YnCQ0Yw  NRQ3KXHXQeeo8PMVCc8psA  TCb2IqEYTYCaIi-o0htU-Q  xYlPXd58QeSyJQU87IjN-A
c6oEvv7RQwmko6BnV7MjqQ  ikRAlaqXSgi1QNatEG4HVA  nUF_9E8GRKSAsbTLc-jtTA  TOxZvRGYQR2IPponfm6aZg  ycchXzUASrqJvAE6IQa_8w
cFzq7qdoR8qv0a2RGRfhqA  jkKboVFmQYujIjBfXuGSGQ  nzIGKXKDT0mwC7-h-KHVqA  tWelC4bbQBmKzmwmlnpVUA  yJi5paQoR2iVlC8MHywVYA
cICuC92ITES36zb6dvrq-g  jktFy9S4QlO8AHRoRt8bEA  O2OxBhBvSqiMDnNuUFdfYw  TXPkIiWESAWNvo0DdZZEBw  yQEcu9w4Qf228kDuS8ZTyA
cR132PY2SliXQkI7ovvnwQ  JrkfumFfTAmUK2QZcZXPPg  oa8fgTj3SaiAtVQw-eDoLQ  TYjzmV84Sw2118plDzDNaw  ZaiLF1ARQ--p9w7X4Kg1vQ
cRbKJzzQRxu1j_PneyBp-g  ju5LARlvTcu7ldaCzGOlQw  OcPIWdlrQRCV6Ngug0ySkA  u_7nOm6_R1ClbyhmhkmVDQ  Zr5izwCaR8eWKEC4bXaepQ

There are no subdirectories named "vBU9z3GiTnOMVrOfRaHi8w", "hu8A1SqwQBuYLrkkLn4dTA", or "HIfOvHOIRSmh2int2wZw3g", either.

Can I get a more competent answer, please? I really need to get the service up and running ASAP.

I've found this article, which talks about truncating a faulty translog, but it uses the magic value "P45vf_YQRhqjfwLMUvSqDw" with absolutely no explanation where it is derived from!


(Xavier Facq) #8

Be carefull with the location, with the clustername : /var/lib/elasticsearch/clustername/

ll /var/lib/elasticsearch/clustername/nodes/0/indices/ 

You don't have indices names ?

@dadoonet @warkolm any idea ?


(Thiago Souza) #9

Starting with 5.x Elasticsearch​ won't use index name and cluster name as directory names to store indices on disk, that's why you are seeing these hash-like indices in /var/lib/elasticsearch.

The exception you have posted is not indicating any corruption to me, what made you conclude this? Have you tried simply restarting Elasticsearch?

Can you attach the whole elasticsearch log here?


#10

Yes, I have tried restarting ES. After a while, it prints this error in the error log and terminates. Googling around suggested that this error occurs when the translog is corrupted.

The full log is 170 Kb; I cannot paste it in a message here; there is a restriction how large a message can be. I've put it on Pastebin, there.

I am not worried about lost data; I have the original logs, the data from which is in the database, and will feed them to Logstash manually, if necessary. The only thing I'll lose is time. But if the database gets corrupted every time there is a power failure and my only recourse is to delete the database and re-create it from scratch, that's simply not acceptable.


(Thiago Souza) #11

I saw your first post on my mobile, so I guess I missed part of the exception message that you have initially posted. I have seen the full exception message and I do agree that the translog is corrupted. I am sorry.

It is very unlikely that the translog gets corrupted due to power failure. Are you using RAID for your disks?

Lastly, a failed recover should not block a node from starting up. The node should be able to startup with failed indices unavailable. The last lines of the log seems to be indicating that the node was stopped externally. Did you stop the service?


#12

No. It's a VirtualBox VM running Linux Mint on a Windows 10 host.

In this case - yes, because the log kept getting filled with this message. Also Kibana says that the Elasticsearch plugin is red, and Logstash's log keeps getting filled with errors that it couldn't connect to the Elasticsearch instance. For all practical purposes, ES isn't working.

So, you are saying that my only recourse is to delete the whole /var/lib/elasticsearch/nodes/ tree, re-create the database from scratch with the original data and hope that it doesn't happen again?


(Thiago Souza) #13

I don't know if there is a way to recover without deleting. Maybe @jasontedor has more to say.

Also you really shouldn't be running a production cluster is such environment since we don't test with that configuration. I would say that the root cause of your corruption lies in your environment and not in Elasticsearch.

Also Windows 10 is not a supported OS neither as host nor as guest. See our Support Matrix for what is supported.

I highly recommend you to move away from this environment as you may hit the same problem again and it won't be Elasticsearch's fault.


(Jason Tedor) #14

You can use the translog tool to truncate the corrupted translog.


#15
  1. It's just a research project, nothing mission-critical.

  2. The Win10 host is for the Linux Mint virtual machine; not for ELK. ELK is running in Linux on that VM.

  3. Linux Mint is basically Ubuntu 16.04, which seems supported, according to your matrix. Or are you saying that Elasticsearch is not able to work reliably on a VM at all? Aren't most of the cloud instances out there just different VMs? In any case, I can't afford to dedicate a separate physical Linux machine to this project (and one with lots of RAM, too, because ES really doesn't like environments with less than 4 Gb RAM).


#16

I guess you didn't bother reading the whole thread. I've already read that article and it is useless, because it uses some "magic" string (P45vf_YQRhqjfwLMUvSqDw), which it totally fails to explain how to get from the error message or anywhere else.


(Jason Tedor) #17

This is not a good way to respond to someone that was only trying to help you, especially on a community forum for an open source project, but frankly ever. You establish yourself as an adversary, and you reduce the likelihood that someone else will wade in and try to collaborate with you on finding a solution.

Here's a suggestion for a better response for you in the future: "okay, how do I find the path to apply that to?"

I see no indication in this thread that you read that article. Your assertion that it is useless is a stretch as if you knew what path to apply the tool to then you could use the documentation on that page to get out of this mess. Therefore, I agree with you that the only challenge here is finding that path.

Let me try to help you with that. You have the index name. You can hit /_cat/indices?v. This will give you a response like:

health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-2017.04.09 qmDPqIHJTlmc-b9CB-gHSQ   5   1          0            0       985b           985b

That uuid is what you're looking for to get the path on disk. So in this case it would be nodes/0/indices/qmDPqIHJTlmc-b9CB-gHSQ/ relative to your data path.

Next, from shard id [[logstash-2017.04.09][0] we know that this is shard 0. Thus, the full path would be nodes/0/indices/qmDPqIHJTlmc-b9CB-gHSQ/0/translog so I would run:

$ bin/elasticsearch-translog truncate -d data/nodes/0/indices/qmDPqIHJTlmc-b9CB-gHSQ/0/translog/

We are going to improve this, so that it's easier to find the path. You can follow this on #24929.

Does that help?


(Damien Bell) #18

As someone who has done support for a number of years -- this is a wonderful / diplomatic answer. Well done.


(system) #19

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.