We are using elasticsearch 2.4.0 on WebSphere 8.5.x as an embedded node.
Server starts up fine, but when indexing is triggered, it starts indexing, but halts after a while with the checksum failure and shards become unavailable:
2016-10-05 12:04:47,338 [EAD1B][fsync][T#1]] ( elasticsearch.index.translog) DEBUG - [583FF1641A23B743E304B8E32E8EAD1B] [sample] translog closed
2016-10-05 12:04:47,338 [EAD1B][refresh][T#1]] ( elasticsearch.index.engine) DEBUG - [583FF1641A23B743E304B8E32E8EAD1B] [sample] engine closed [engine failed on: [refresh failed]]
2016-10-05 12:04:47,338 [EAD1B][refresh][T#1]] ( elasticsearch.index.engine) WARN - [583FF1641A23B743E304B8E32E8EAD1B] [sample] failed engine [refresh failed]
java.lang.IllegalStateException: Illegal CRC-32 checksum: 3388679863 (resource=FSIndexOutput(path="/test/lbcell9/WASL059/temp/TestIndex/4e4f8e1769ed3a8177c379a7e681b77d/nodes/0/indices/sample/0/index/_1m.cfs"))
Ok, but we were able to work with elasticsearch 1.0.2 on z/OS.
Also when I look at the source code of elasticsearch (bootstrap checks for JVM), the IBM SDK 2.8 is supported (not mentioned in the matrix though)
Admittedly, Lucene does not see much testing on z/OS, but the endian-ness should not be a problem: java "ensures" this for us, cross platforms.
That said, this exception is spooky
The CRC32.getValue() method is returning an unsigned int as a java long, and so the top 32 bits should be 0, as that if statement is checking. What's odd is the value in your exception (3388679863) does in fact have all 0s in the top 32 bits, so I don't understand why the if was triggered. It's as if the JVM incorrectly treated the if condition as true when it's actually false.
Thank you Jörg and Michael for your inputs.
I have additional observations regarding this issue.
So this is a bulk indexing request that we trigger, and sends a bulk request for every 100 documents.
We specified a refresh_interval of 5s
This exception comes in when the scheduler for refresh gets triggered and fails from then on
However, when we explicitly send in a refresh followed by an optimize request , that doesn't seem to fail; which is where i am lost as refresh request is same, whether it being done via scheduling or explicit.
When we disabled the refresh_interval (-1) for the bulk indexing request, it worked without any issues. In this case refresh and optimize are done at the very end (after all documents are indexed).
Any ideas on this odd behavior ?
As much as I'd like to see the spooky bug solved, I really don't recommend you try and find a workaround. You are running Elasticsearch in an unsupported way (embedded in another JVM process) on an unsupported OS (z/OS) with an unsupported JVM (J9). If you spend enough time on this I bet you could make it stable but as soon as you upgrade, even to a patch release, this could break again. Because unsupported means untested.