How to make ES cluster resilient to FileSystemException

Jae · March 19, 2016, 5:16am

Hi

I am using Elasticsearch 1.5.2 and ingesting the data through elasticsearch-hadoop connector using REST api from Samza. I set up replication factor as 1 (total two copies).

Somehow, one machine got file system error such as the following:

[2016-03-18 14:23:06,584][WARN ][cluster.action.shard ] [machine1] [index1][5] received shard failed for [index1][5], node[5uEr4HffS0ihOYizoDDU3w], [R], s[STARTED], indexUUID [YOUAlFTgSNWGck2ikaxtag], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[machine1][inet[/ip1:9300]][indices:data/write/bulk[s][r]]]; nested: IndexFailedEngineException[[index1][5] Index failed for [default#93D3AAB5-93A8-40A1-B03A-004C3B14B0EA]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index1/5/index/_3082.fdx (Read-only file system)]; ]]

Let's say I have four indices, index1,2,3,4. The above error messages were thrown for only index1 and index2 but ingestion for index3 and index4 totally stopped with the following error messages:

2016-03-18 14:54:03 EsPublisher [ERROR] Exception on flushing the writer: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..

When we excluded shard allocation for this machine, the ingestion came back to normal.

How can we monitor file system behavior inside of Elasticsearch? The process was running OK and the cluster health status was green. The alert was detected on the ingestion side and disk IO system checker itself was triggered after a while.
How can I make Elasticsearch resilient to FileSystemException? Until the machine was excluded from shard allocation, the ingestion totally stopped. If I increase the replication factor as 2(total 3 copies) and set the write consistency as quorum, will everything OK?

Thank you
Best, Jae

warkolm · March 19, 2016, 5:31am

What sort of filesystem are you using here?

Jae · March 19, 2016, 5:58am

Hi Mark, I am using RAID0, ext4.

warkolm · March 19, 2016, 8:05am

You should really upgrade, there were known issues around corruption in earlier versions that are fixed now.

Jae · March 19, 2016, 2:57pm

Yes, I will upgrade to 1.7.5 right away. May I know what issues there were? I will try to look release notes anyway.

rusty · March 19, 2016, 7:36pm

Hi, got similar errors on ES 2.1.0, on ext4 raid6 (OS silently remount FS to readonly due problems with fibre channel to external disk bucket). Primary and replica of daily index was completely lost. ES went red only after cluster restart.

Jae · March 19, 2016, 9:32pm

Sadly, this problem is not fixed even in version 2, thanks rusty.

warkolm · March 19, 2016, 10:59pm

How would you expect an application to handle the OS remounting a disk like this?

Jae · March 20, 2016, 4:33am

I expected that somewhat automatic unrecoverable error handling such as exclusion of shard allocation from that faulty machine. I am wondering that somewhat external administration tool such as Marvel can do that.

rusty · March 20, 2016, 7:58am

As the Elastic cannot show the reason of failure in cluster status, the only way exclude that node from cluster (such node cannot start with such error either, so this be consistent). On that node disable all write operations and check while data path becomes r/w again, wait timeout (~5 min) and try to rejoin cluster.

Topic		Replies	Views
Too many files open, recovering cluster Elasticsearch	12	1604	July 5, 2017
ES nodes crashing: failed to send failed shard Elasticsearch	6	2519	July 5, 2017
Losing data after Elasticsearch restart Elasticsearch	8	3167	July 6, 2017
Get the warning received shard failed for certain replicas Elasticsearch	2	1352	July 6, 2017
Reagrding the bulk inserting data to ES Cluster Elasticsearch	2	525	July 5, 2017

How to make ES cluster resilient to FileSystemException

Related topics