Elasticsearch on z/OS

bneelima84 · October 7, 2016, 3:48am

Hi,

We are using elasticsearch 2.4.0 on WebSphere 8.5.x as an embedded node.

Server starts up fine, but when indexing is triggered, it starts indexing, but halts after a while with the checksum failure and shards become unavailable:

2016-10-05 12:04:47,338 [EAD1B][fsync][T#1]] ( elasticsearch.index.translog) DEBUG - [583FF1641A23B743E304B8E32E8EAD1B] [sample][0] translog closed
2016-10-05 12:04:47,338 [EAD1B][refresh][T#1]] ( elasticsearch.index.engine) DEBUG - [583FF1641A23B743E304B8E32E8EAD1B] [sample][0] engine closed [engine failed on: [refresh failed]]
2016-10-05 12:04:47,338 [EAD1B][refresh][T#1]] ( elasticsearch.index.engine) WARN - [583FF1641A23B743E304B8E32E8EAD1B] [sample][0] failed engine [refresh failed]
java.lang.IllegalStateException: Illegal CRC-32 checksum: 3388679863 (resource=FSIndexOutput(path="/test/lbcell9/WASL059/temp/TestIndex/4e4f8e1769ed3a8177c379a7e681b77d/nodes/0/indices/sample/0/index/_1m.cfs"))
at org.apache.lucene.codecs.CodecUtil.writeCRC(CodecUtil.java:475)
at org.apache.lucene.codecs.CodecUtil.writeFooter(CodecUtil.java:309)
at org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.write(Lucene50CompoundFormat.java:103)
at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4659)
at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:492)
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:459)
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:503)
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:615)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:424)
at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:286)
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:261)
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:251)
at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:137)
at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:154)
at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:669)
at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:661)
at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1343)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)

Environment details:
os.name : z/OS
os.arch: s390x
os.version: 02.02.00
java.version: 1.8.0
java.fullversion: JRE 1.8.0 IBM J9 2.8 z/OS s390x-64 Compressed References 20160106_284759 (JIT enabled, AOT enabled) J9VM - R28_20160106_1341_B284759 JIT - tr.r14.java_20151209_107110.02 GC - R28_20160106_1341_B284759_CMPRSS J9CL - 20160106_284759

Filesystem: ZFS

Any pointers on why this can happen ?

bneelima84 · October 7, 2016, 3:52am

Also, the translog keeps growing and the system goes out of disk space !!

Christian_Dahlqvist · October 7, 2016, 4:18am

According to the support matrix both z/OS and IBM JDK are not supported, which may explain why you are experiencing difficulties.

bneelima84 · October 7, 2016, 4:31am

Ok, but we were able to work with elasticsearch 1.0.2 on z/OS.
Also when I look at the source code of elasticsearch (bootstrap checks for JVM), the IBM SDK 2.8 is supported (not mentioned in the matrix though)

jprante · October 7, 2016, 7:28am

Can you specify "after a while"? Is it seonds/minutes/hours/days? Is it corresponding to the refresh interval?

If it's the refresh interval, I assume the issue is due to big endian on z/OS. Lucene checksum CRC-32 check seems to be correct on little endian only.

Elasticsearch 1.0.2 "works" because there were no checksums at all and therefore no reliablity of written index segments.

bneelima84 · October 7, 2016, 3:54pm

Its within minutes; say after indexing 20000 items..
Any ideas on how this can be fixed / any way to disable these checks ?

bneelima84 · October 7, 2016, 4:11pm

To be precise, it happens at the refresh (we specified a refresh interval of 5s)

jprante · October 7, 2016, 7:12pm

There is a bug in Lucene, reading / writing CRC-32 index checksums do not take endianness into account

github.com

apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java#L543


 */
static long readCRC(IndexInput input) throws IOException {
  long value = input.readLong();
  if ((value & 0xFFFFFFFF00000000L) != 0) {
    throw new CorruptIndexException("Illegal CRC-32 checksum: " + value, input);
  }
  return value;
}


/**
 * Writes CRC32 value as a 64-bit long to the output.
 * @throws IllegalStateException if CRC is formatted incorrectly (wrong bits set)
 * @throws IOException if an i/o error occurs
 */
static void writeCRC(IndexOutput output) throws IOException {
  long value = output.getChecksum();
  if ((value & 0xFFFFFFFF00000000L) != 0) {
    throw new IllegalStateException("Illegal CRC-32 checksum: " + value + " (resource=" + output + ")");
  }
  output.writeLong(value);
}

So there is no currently available workaround I know of.

Maybe a Lucene developer can chime in and fix this, maybe with help of java.nio.ByteOrder and System.getProperty("os.arch")

mikemccand · October 7, 2016, 9:49pm

Admittedly, Lucene does not see much testing on z/OS, but the endian-ness should not be a problem: java "ensures" this for us, cross platforms.

That said, this exception is spooky

The CRC32.getValue() method is returning an unsigned int as a java long, and so the top 32 bits should be 0, as that if statement is checking. What's odd is the value in your exception (3388679863) does in fact have all 0s in the top 32 bits, so I don't understand why the if was triggered. It's as if the JVM incorrectly treated the if condition as true when it's actually false.

J9 has had bugs that affect Lucene in the past, and it looks like only the very recent (next to be released?) version is able to pass all Lucene tests: https://issues.apache.org/jira/browse/LUCENE-7432

Is it possible to test with that J9 version and see if this exception still happens?

Mike McCandless

bneelima84 · October 8, 2016, 9:13am

Thank you Jörg and Michael for your inputs.
I have additional observations regarding this issue.

So this is a bulk indexing request that we trigger, and sends a bulk request for every 100 documents.
We specified a refresh_interval of 5s
This exception comes in when the scheduler for refresh gets triggered and fails from then on

However, when we explicitly send in a refresh followed by an optimize request , that doesn't seem to fail; which is where i am lost as refresh request is same, whether it being done via scheduling or explicit.

When we disabled the refresh_interval (-1) for the bulk indexing request, it worked without any issues. In this case refresh and optimize are done at the very end (after all documents are indexed).
Any ideas on this odd behavior ?

nik9000 · October 8, 2016, 11:35am

As much as I'd like to see the spooky bug solved, I really don't recommend you try and find a workaround. You are running Elasticsearch in an unsupported way (embedded in another JVM process) on an unsupported OS (z/OS) with an unsupported JVM (J9). If you spend enough time on this I bet you could make it stable but as soon as you upgrade, even to a patch release, this could break again. Because unsupported means untested.

eakst7 · January 4, 2017, 8:34pm

This is a bug in the IBM JVM. It is fixed in the latest release:

http://www-01.ibm.com/support/docview.wss?uid=swg1IV90684

bneelima84 · January 6, 2017, 4:50am

Thanks for your inputs Ed !!

Topic		Replies	Views
Corruptindexexception warn message unclear Elasticsearch	5	951	July 6, 2017
Segment errors Elasticsearch	12	651	July 6, 2017
Corrupt index, checksum failed Elasticsearch	1	1080	July 6, 2017
Elasticsearch 2.4.0 crashing during heavy bulk index loads Elasticsearch	18	4496	July 5, 2017
"failed to merge java.io.EOFException: read past EOF: NIOFSIndexInput(" Elasticsearch	17	4011	July 6, 2017

Elasticsearch on z/OS

Related topics