Hi
I am using Elasticsearch 1.5.2 and ingesting the data through elasticsearch-hadoop connector using REST api from Samza. I set up replication factor as 1 (total two copies).
Somehow, one machine got file system error such as the following:
[2016-03-18 14:23:06,584][WARN ][cluster.action.shard ] [machine1] [index1][5] received shard failed for [index1][5], node[5uEr4HffS0ihOYizoDDU3w], [R], s[STARTED], indexUUID [YOUAlFTgSNWGck2ikaxtag], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[machine1][inet[/ip1:9300]][indices:data/write/bulk[s][r]]]; nested: IndexFailedEngineException[[index1][5] Index failed for [default#93D3AAB5-93A8-40A1-B03A-004C3B14B0EA]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index1/5/index/_3082.fdx (Read-only file system)]; ]]
Let's say I have four indices, index1,2,3,4. The above error messages were thrown for only index1 and index2 but ingestion for index3 and index4 totally stopped with the following error messages:
2016-03-18 14:54:03 EsPublisher [ERROR] Exception on flushing the writer: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [IP2:9200] returned Internal Server Error(500) - [IndexFailedEngineException[[index3][0] Index failed for [default#a93652de-de98-4400-a3b1-d04270609f67]]; nested: FileNotFoundException[/mnt/elasticsearch/data/cluster1/nodes/0/indices/index3/0/index/_39da.fdx (Read-only file system)]; ]; Bailing out..
When we excluded shard allocation for this machine, the ingestion came back to normal.
-
How can we monitor file system behavior inside of Elasticsearch? The process was running OK and the cluster health status was green. The alert was detected on the ingestion side and disk IO system checker itself was triggered after a while.
-
How can I make Elasticsearch resilient to FileSystemException? Until the machine was excluded from shard allocation, the ingestion totally stopped. If I increase the replication factor as 2(total 3 copies) and set the write consistency as quorum, will everything OK?
Thank you
Best, Jae