I am fairly new to Elasticsearch and am very impressed thus far. However, I am running into a problem as I test it to insure it is something I could run in a production environment. I'm hopeful that someone here might have some insight as to my issue.
I'm running ES 2.4 under Linux (64bit) Mint 18 (4 CPUs) using the Oracle 8 JDK (1.8.0_60). I have also tried, with the same results, using version 1.8.0_101. I'm using the default ES settings except that I am allocating 4Gb and locking that memory as per the documented guidelines. For now, I'm running 1 node with 4 shards.
The issue I'm seeing is that, when I try to index large volumes of documents (ie: 8 threads inserting 100 documents at a time using the bulk api) through the Java client, I start seeing messages about the index being corrupted and the JVM crashes, thus bringing down ES. Here are some of the logging details I see when this occurs:
[2016-09-12 10:44:57,386][ERROR][index.engine ] [n1] [index1] failed to merge
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5633a4a2 actual=6fae7389 (resource=BufferedChecks
[2016-09-12 10:44:57,417][WARN ][index.engine ] [n1] [index1] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expect
ed=5633a4a2 actual=6fae7389 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/home/elastic/index1/nodes/0/indices/index1/0/index/_45v
As you can see, the logs indicate possible checksum errors on the drive. However, I'm able to run extremely intensive Java (using NIO) apps using this drive that push it far beyond what ES is but I never see this issue in any other application. Also, the drive is only at 30% capacity at this point. Note that when I ran with a smaller load (only 4 publishers) all was fine. I was able to run for days without issue. When I increased that to 8, it took about 20 minutes before this issue occurred. The only correlation I can make in regards to the number producers for my test app is that I do have 4 cpus on my test system.
So I have a couple of questions:
1.) Is there a way to recover (or at-least partially recover) the index when this occurs? Right now I'm just looking
for a way to get as much data back as I can so I don't have to re-insert everything.
2.) Has anyone seen this before and, if so, how I can I avoid this?