Shard failed at ElasticSearch / Docker / Windows

eiri · December 30, 2018, 9:55pm

I have a (development) single node elasticsearch server under Windows 10/Docker. Creating the server and an index completes successfully. After filling the index with about 1.7 milion documents the shard fails, causing an elasticsearch red index status. This number of documents seems to be constant on every try with the same parameters,

{"log":"[2018-12-29T18:04:31,279][WARN ][o.e.i.e.Engine           ] [elasticsearch01] [epg_v21][0] failed to rollback writer on close\n","stream":"stdout","time":"2018-12-29T18:04:31.2819642Z"}
{"log":"java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/VBTjn1OcSvSJq6iEnqmLAg/0/index/_4o.cfs\n","stream":"stdout","time":"2018-12-29T18:04:31.2820693Z"}
{"log":"\u0009at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820818Z"}
{"log":"\u0009at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820904Z"}
{"log":"\u0009at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820977Z"}
{"log":"\u0009at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:245) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821048Z"}
{"log":"\u0009at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821222Z"}
{"log":"\u0009at java.nio.file.Files.delete(Files.java:1141) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821298Z"}
{"log":"\u0009at org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:371) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821359Z"}
{"log":"\u0009at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:340) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821425Z"}
{"log":"\u0009at org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:63) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821493Z"}
{"log":"\u0009at org.elasticsearch.index.store.ByteSizeCachingDirectory.deleteFile(ByteSizeCachingDirectory.java:175) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821562Z"}
{"log":"\u0009at org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:63) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.282163Z"}
{"log":"\u0009at org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:733) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821703Z"}
{"log":"\u0009at org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:738) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821764Z"}
{"log":"\u0009at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821833Z"}

...

{"log":"\u0009at java.lang.Thread.run(Thread.java:834) [?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.3199944Z"}
{"log":"[2018-12-29T18:04:31,319][INFO ][o.e.c.r.a.AllocationService] [elasticsearch01] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[epg_v21][0]] ...]).\n","stream":"stdout","time":"2018-12-29T18:04:31.3293225Z"}
{"log":"[2018-12-29T18:07:50,808][WARN ][o.e.i.e.Engine           ] [elasticsearch01] [epg_v21][0] failed to rollback writer on close\n","stream":"stdout","time":"2018-12-29T18:07:50.810894Z"}
{"log":"java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/VBTjn1OcSvSJq6iEnqmLAg/0/index/_7.cfs\n","stream":"stdout","time":"2018-12-29T18:07:50.8109745Z"}

Looking at the index (_cat/indices) during filling it, it first increments the document count, but after this error the document count is reset to 0 and index seems to be corrupted. Index state remains red after this error. A smaller amount of documents are indexed successfully.

I use a docker volume mount to make the indices persistent. That seems to work after creation, but fails after above error.

When I look to the files created under windows, I can see that it first filled the index (looking at its size), but the file it complaines in the log is disappeared. Also creating a file at windows , shows up in the container, so the mount still exists. This also happens at that moment for the monitoring and kibana indexes.

I use the following docker-compose.yml to start:

version: "2"

services:
    elasticsearch01:
        image: docker.elastic.co/elasticsearch/elasticsearch:6.5.1
        container_name: elasticsearch01
        ports:
            - "9200:9200"
            - "9300:9300"
    volumes:
       - /host_mnt/c/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
       - /host_mnt/c/elasticsearch/config/log4j2.properties:/usr/share/elasticsearch/config/log4j2.properties
       - /host_mnt/c/elasticsearch/lib:/usr/share/elasticsearch/data
       - /host_mnt/c/elasticsearch/log:/usr/share/elasticsearch/logs
    environment:
        - node.name=elasticsearch01
    networks:
        - esnet
                
            
networks:
  esnet:
    driver: bridge

I first tried to start docker-compose from my windows WSL bash prompt (using the mountpoints from within WSL), but this has the same effect.

I use elasticdump to do bulk imports. I tried several batch sizes but with the same result. Somehow this problem arrise after about 1700000 documents, in any order i perform the import (splitting up, small chunks, random order or chunks).

Other things I tried is enlarging memory (for docker containers), etc.

Diskspace seems to be no problem, also setting cluster.routing.allocation.disk.threshold_enabled to false to disable any checks.

Anyone able to help to resolve corrupting my index?

balumurari1 · December 31, 2018, 6:21am

try this to eliminate red status and making the index to status green

PUT my_index/_settings
{
"index" : {
"number_of_replicas" : 0
}
}

eiri · December 31, 2018, 8:03am

thanks, i already did this. my development server only has primary shards. i tried both with replica's and without.

my inexperienced guess is that there is a fractional moment that the docker volume mount is not available and written to that primary shard is not possible. it also corrupts the other (kibana and xpack monitoring) indexes. but then i should expect to see some logging for the mounts either at docker or os level about this instability

DavidTurner · December 31, 2018, 8:19am

I suspect filesystem issues. Elasticsearch is rather picky about its filesystem behaving correctly when subject to concurrent access, and the setup you describe has quite a stack of abstractions on top of the actual disk, some of which are quite new. In particular there seems to be something trying to emulate a Unix-like filesystem on top of a Windows-like one, and this sounds very difficult to get right given how different their semantics can be. Not to say that there's definitely an issue in that layer, just that this is the first place I'd look.

Why not run Elasticsearch directly within Windows? That way it'd know it has an NTFS filesystem to play with and would behave appropriately.

eiri · December 31, 2018, 8:51am

You are probably right about the instability of the all these layers and your explanation makes sense to me. The advantage of a container based installation for development purposes are clear, i think. I had an installation of elastic under the Windows Linux Subsystem which did work without problems.

Jusy trying to find a solution with a docker setup. That this problem occur after a fixed consistent amount of documents should help me find a clue or at least find some logging other that from elastic, that might help me resolving this problem. maybe after adding some extra loglevels.

I will fallback to a solution, like you propose, but this is not my first option.

system · January 28, 2019, 8:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard faliure Elasticsearch	6	570	July 6, 2017
Error: CorruptIndexException when reading from gateway Elasticsearch	5	970	July 6, 2017
Elasticsearch shard unassigned and changed to red Elasticsearch	16	2141	July 28, 2018
Indexing/shard failure Elasticsearch	5	995	July 6, 2017
Failing Replica Shards Elasticsearch	5	1204	July 6, 2017

Shard failed at ElasticSearch / Docker / Windows

Related topics