I have a (development) single node elasticsearch server under Windows 10/Docker. Creating the server and an index completes successfully. After filling the index with about 1.7 milion documents the shard fails, causing an elasticsearch red index status. This number of documents seems to be constant on every try with the same parameters,
{"log":"[2018-12-29T18:04:31,279][WARN ][o.e.i.e.Engine ] [elasticsearch01] [epg_v21][0] failed to rollback writer on close\n","stream":"stdout","time":"2018-12-29T18:04:31.2819642Z"}
{"log":"java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/VBTjn1OcSvSJq6iEnqmLAg/0/index/_4o.cfs\n","stream":"stdout","time":"2018-12-29T18:04:31.2820693Z"}
{"log":"\u0009at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820818Z"}
{"log":"\u0009at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820904Z"}
{"log":"\u0009at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2820977Z"}
{"log":"\u0009at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:245) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821048Z"}
{"log":"\u0009at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821222Z"}
{"log":"\u0009at java.nio.file.Files.delete(Files.java:1141) ~[?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821298Z"}
{"log":"\u0009at org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:371) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821359Z"}
{"log":"\u0009at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:340) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821425Z"}
{"log":"\u0009at org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:63) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821493Z"}
{"log":"\u0009at org.elasticsearch.index.store.ByteSizeCachingDirectory.deleteFile(ByteSizeCachingDirectory.java:175) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821562Z"}
{"log":"\u0009at org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:63) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.282163Z"}
{"log":"\u0009at org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:733) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821703Z"}
{"log":"\u0009at org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:738) ~[elasticsearch-6.5.1.jar:6.5.1]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821764Z"}
{"log":"\u0009at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]\n","stream":"stdout","time":"2018-12-29T18:04:31.2821833Z"}
...
{"log":"\u0009at java.lang.Thread.run(Thread.java:834) [?:?]\n","stream":"stdout","time":"2018-12-29T18:04:31.3199944Z"}
{"log":"[2018-12-29T18:04:31,319][INFO ][o.e.c.r.a.AllocationService] [elasticsearch01] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[epg_v21][0]] ...]).\n","stream":"stdout","time":"2018-12-29T18:04:31.3293225Z"}
{"log":"[2018-12-29T18:07:50,808][WARN ][o.e.i.e.Engine ] [elasticsearch01] [epg_v21][0] failed to rollback writer on close\n","stream":"stdout","time":"2018-12-29T18:07:50.810894Z"}
{"log":"java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/VBTjn1OcSvSJq6iEnqmLAg/0/index/_7.cfs\n","stream":"stdout","time":"2018-12-29T18:07:50.8109745Z"}
Looking at the index (_cat/indices) during filling it, it first increments the document count, but after this error the document count is reset to 0 and index seems to be corrupted. Index state remains red after this error. A smaller amount of documents are indexed successfully.
I use a docker volume mount to make the indices persistent. That seems to work after creation, but fails after above error.
When I look to the files created under windows, I can see that it first filled the index (looking at its size), but the file it complaines in the log is disappeared. Also creating a file at windows , shows up in the container, so the mount still exists. This also happens at that moment for the monitoring and kibana indexes.
I use the following docker-compose.yml to start:
version: "2"
services:
elasticsearch01:
image: docker.elastic.co/elasticsearch/elasticsearch:6.5.1
container_name: elasticsearch01
ports:
- "9200:9200"
- "9300:9300"
volumes:
- /host_mnt/c/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
- /host_mnt/c/elasticsearch/config/log4j2.properties:/usr/share/elasticsearch/config/log4j2.properties
- /host_mnt/c/elasticsearch/lib:/usr/share/elasticsearch/data
- /host_mnt/c/elasticsearch/log:/usr/share/elasticsearch/logs
environment:
- node.name=elasticsearch01
networks:
- esnet
networks:
esnet:
driver: bridge
I first tried to start docker-compose from my windows WSL bash prompt (using the mountpoints from within WSL), but this has the same effect.
I use elasticdump to do bulk imports. I tried several batch sizes but with the same result. Somehow this problem arrise after about 1700000 documents, in any order i perform the import (splitting up, small chunks, random order or chunks).
Other things I tried is enlarging memory (for docker containers), etc.
Diskspace seems to be no problem, also setting cluster.routing.allocation.disk.threshold_enabled to false to disable any checks.
Anyone able to help to resolve corrupting my index?