How to make ES cluster resilient to FileSystemException

Hi

I am using Elasticsearch 1.5.2 and ingesting the data through elasticsearch-hadoop connector using REST api from Samza. I set up replication factor as 1 (total two copies).

Somehow, one machine got file system error such as the following:

[2016-03-18 14:23:06,584][WARN ][cluster.action.shard ] [machine1] [index1][5] received shard failed for [index1][5], node[5uEr4HffS0ihOYizoDDU3w], [R], s[STARTED], indexUUID [YOUAlFTgSNWGck2ikaxtag], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[machine1][inet[/ip1:9300]][indices:data/write/bulk[s][r]]]; nested: IndexFailedEngineException[[index1][5] Index failed for [default#93D3AAB5-93A8-40A1-B03A-004C3B14B0EA]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index1/5/index/_3082.fdx (Read-only file system)]; ]]

Let's say I have four indices, index1,2,3,4. The above error messages were thrown for only index1 and index2 but ingestion for index3 and index4 totally stopped with the following error messages:

2016-03-18 14:54:03 EsPublisher [ERROR] Exception on flushing the writer: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..

When we excluded shard allocation for this machine, the ingestion came back to normal.

  1. How can we monitor file system behavior inside of Elasticsearch? The process was running OK and the cluster health status was green. The alert was detected on the ingestion side and disk IO system checker itself was triggered after a while.

  2. How can I make Elasticsearch resilient to FileSystemException? Until the machine was excluded from shard allocation, the ingestion totally stopped. If I increase the replication factor as 2(total 3 copies) and set the write consistency as quorum, will everything OK?

Thank you
Best, Jae

What sort of filesystem are you using here?

Hi Mark, I am using RAID0, ext4.

You should really upgrade, there were known issues around corruption in earlier versions that are fixed now.

Yes, I will upgrade to 1.7.5 right away. May I know what issues there were? I will try to look release notes anyway.

Hi, got similar errors on ES 2.1.0, on ext4 raid6 (OS silently remount FS to readonly due problems with fibre channel to external disk bucket). Primary and replica of daily index was completely lost. ES went red only after cluster restart.

Sadly, this problem is not fixed even in version 2, thanks rusty.

How would you expect an application to handle the OS remounting a disk like this?

I expected that somewhat automatic unrecoverable error handling such as exclusion of shard allocation from that faulty machine. I am wondering that somewhat external administration tool such as Marvel can do that.

As the Elastic cannot show the reason of failure in cluster status, the only way exclude that node from cluster (such node cannot start with such error either, so this be consistent). On that node disable all write operations and check while data path becomes r/w again, wait timeout (~5 min) and try to rejoin cluster.