Need help with IBM JDK Issues with ES 1.4.5


(Ramkumar Ramalingam) #1

Hello !

Today we tried upgrading our ES instance from 1.4.0 to 1.4.5 and realized that it fails to initialize with all versions of IBM JDK and ES throws the following message while startup.

{1.4.5}: Initialization Failed ...

Since we are very much tied with IBM JDK for various reasons, like to use this thread to figure out how to make our ES version work with IBM JDK. We also realized that "es.bypass.vm.check" property can help overcome the startup issue set for IBM JDK.

We have many ES instances running well in numerous contexts with the IBM JDK today and we need to be able to continue to take new levels of ES. If there are issues with later drivers of ES on IBM Java then I would like to understand what those are and will enlist the help of the IBM JDK team to investigate these. Is there a list of specific bugs identified for IBM JDK that we can take up with IBM JDK team? The information in the pull that introduced the ES initialization exception (https://github.com/elastic/elasticsearch/pull/7580) is not specific.

Appreciate your help here !!


(Mark Walkom) #2

We do this due to known issues with this JVM in Lucene.
See https://wiki.apache.org/lucene-java/JavaBugs#IBM_J9_Bugs for more information.


#3

Hi,

I added this check, let me give you some background behind it. (this is all from the lucene project perspective)

With the way we test, we find lots of bugs in the jvm. These jenkins failures are time-consuming to look into: if its a bug like that, we can count on it eating up a couple days (and nights) of a few people's time. Typically we assume the bug is in our own code initially (which is always the most likely case), and then after finally realizing all that time was completely wasted, we are left digging deeper.

Usually these bugs aren't easy to reproduce: the worst ones don't crash, they just generate wrong code and give you bad computations. Additionally these bugs in the dynamic compilation process or garbage collection usually depend on timing. Sometimes the "repro" is to run the entire test suite in a loop with a certain random seed until you get lucky. When you finally get there and have a way to reproduce it, then what to do about it?

Getting the bugs to the right people has gotten much much better with openjdk/oracle. These days Oracle QA team works closely to give us regular snapshot builds, and channels communication to make sure our issues don't get lost: its much better! But I'm scarred from the past.

Once we get that far, even if someone is looking at it, doesn't mean it will get fixed easily or at all. Look how long it took to make repros on https://issues.apache.org/jira/browse/LUCENE-5212, and then the resulting debugging on https://bugs.openjdk.java.net/browse/JDK-8024830 (e.g. half-megabyte of assembler to look through). This one was additionally tricky because you needed sandy bridge CPU to really trigger it.

Maybe IBM JVM bugs like https://groups.google.com/forum/#!msg/elasticsearch/ZWWxoj8HC7E/X7rWo3zS5ZkJ are equally complicated!

So in order to prevent problems, its not that we can just give you a list of bugs, because we don't have them. It needs a whole process around a bunch of stuff:

  • Who will take ownership / get things moving / setup jenkins servers / etc
  • Which versions should be tested in continous builds? 7? 8?
  • Which options should be tested? 32/64bit? Which garbage collectors?
  • Who will look at the test failures and sort through/debug them?
  • How to track bugs against existing bugs already being worked / fixed?
  • How to report new bugs that show up?
  • How to get snapshot builds on a regular basis and find bugs before they make it into releases?

We just don't have anyone working on it for IBM. IBM got slowly worse and worse. At some point there were lots of parameters just disabling compilation to avoid a bunch of methods which would get miscompiled. The large amount of failures created a lot of noise and confusion, I remember screaming to just disable the IBM JDK from tests, which seems to have ultimately happened. Sometimes things would seem to get fixed (https://issues.apache.org/jira/browse/LUCENE-4987) but its not as transparent as the Oracle case, we really don't know what is happening.

Like the rest of open-source, you gotta scratch your own itch. If you really want IBM JDK to work, then grab that IBM JDK and see if you can get the lucene test suite passing several hundred/thousand times in a row. Triage the failures and work to get at least the critical ones fixed.

Even for the Oracle/openjdk case, its a struggle. I generally won't waste my time unless I think it can cause data corruption. If its just rare crashes, its not worth it. We really want to be writing search engines and not QA testing the JVM!


(Jörg Prante) #4

Would love to help, but I can't either. I have IBM Power 550 Express hardware but only few resources are available to build a reliable developer software infrastructure for testing Lucene/Elasticsearch on non-x86 hardware. IBM JVM is a black box with an unknown state of open/fixed errors, making the whole developer process indeterminable and fully dependent on IBM. I think this vendor lock-in shows how much worth it is to be dedicated to open source like the OpenJDK is.


(George T Chan) #5

Hi,

I am from the IBM JVM team. I can help here.

I am going to work on setting up multiple Jenkins instances in IBM that will run Lucene tests periodically using IBMJDK 32/64 bit on different platforms. I would like to get this to the point that:

  1. These jenkins instance(s) will report failures directly to the Lucene mailing list.
  2. We will upgrade IBM JVM when new versions are available.
  3. We don't mind to be the one that monitor the builds and sort bugs out, but I think it will be beneficial for the community to be involved too.

On the other questions:
*. "Which versions should be tested in continuous builds? 7? 8?. I propose that we do the latest version of 7 and 8.
*. Which options should be tested? 32/64bit? which garbage collectors? I propose we spend time on 64 bit first. I need community input on garbage collectors.

  • How to track bugs against existing bugs already being worked/fixed? I can keep track of the IBM JVM bug list. Need to find a way to share it with the community. Suggestions?
  • How to report new bugs that show up?
    I suggest that new bugs be reported to one of these 2 places with the hash tag #IBMJDK
    https://developer.ibm.com/answers/questions/

In the mean time if there are suggestions on posting heads up messages somewhere. I can monitor these too.

  • How to get snapshot builds on a regular basis and find bugs before they make it into releases? would setting up multiple jenkins instances and running Lucene test periodically address this issue?

I can be the point of contact if there are questions related to IBM JVM.


#6

Hi George, this sounds like a great step, thank you for responding!

Just a few more questions: about the mailing lists, I think reporting failures to the lucene developer list is fine, its ultimately where they need to go. However it may be noisy at first for several reasons: not just jvm bugs but maybe openjdk assumptions in our code/build somewhere. If build failure messages have a prefix or tag to indicate IBM JDK testing, then nobody can complain, they can just adjust mail filters. I also recommend subscribing the source email address being used by jenkins to the developer list, so that build failures don't have to go through moderation everytime.

To help developers, is there an easy way to retrieve different versions so developers can try to reproduce test failures? If they differ from http://www.ibm.com/developerworks/java/jdk/linux/download.html then its not obvious at the moment how to retrieve them. This is especially important for e.g. any snapshot build or anything like that. In the openjdk case we have already had really good success getting this stuff fixed in snapshot builds so the bugs never get seen by real users!

As far as the IBM JVM bug list, hmm, well I can describe how I deal with it today with openjdk. The openjdk bug tracker is indexed by google, so usually after a few google searches, I can look for fixed/already-reported bugs. Additionally i subscribe to hotspot compiler mailing lists and look at all changes (not just bugs) to have an idea of what is going on.

Thank you again for responding here.


(George T Chan) #7

Hi, thank you for the suggestions.

On the mailing list. I will try to adopt your suggestions. I will reply to this thread with the tag we will use to indicate IBM SDK testing at a later day.

On the retrieve different versions front. The developerswork link is the right place.

On the IBM JVM bug list front. I will see if we can generate something similar.

In any case, I will post regular progress updates.

Any other ideas, suggestions and/or pointers are most welcome.

thanks again for the reply.


(George T Chan) #8

Hi Robert,

As you might be aware. The 2 bugs reported in dW. One has a fix. The other one also has a fix that is almost ready. It is under code review.

We also have setup a Jenkins server to automatically checkout Lucene code and run tests. It is running on Linux only at the moment. We are in the process of checking out the setup. Windows is on its way.

We are moving forward to address some of the issues raised. I wonder if you can advise the process that we should follow to enable ElasticSearch to support the IBM JDK again, and what and if anything I can do to speed up this process?


#9

Hey, thanks so much for the quick responses on those bugs. I think we have made great progress, for the next release of java 8.

I already pushed changes so that elasticsearch tests are passing (at least once) with J9. I do not think elasticsearch is impacted by the bugs we found in the lucene test suite, since it does not use that functionality, but I plan to do a little more testing to make sure.

The only concern if have left is https://issues.apache.org/jira/browse/LUCENE-6557, but I am hoping that was caused by a bad RAM chip in my computer. I am stressing those tests again now to see.

Hopefully if everything is good, I will submit a pull request to adjust the bootup logic, something to just fail on only older J9 versions (< 2.8).

Separately on the lucene side we can adjust our recommendations, to me the most important issues are:

Thanks for being so responsive to the issues!


(George T Chan) #10

thank you .... I will follow-up on LUCENE-6522. It looks positive at this point. I will post an update if next week. I will also keep an eye on LUCENE-6557. Thanks in advance for submitting a pull request.


(George T Chan) #11

Hi, the fix for the J9 issue reported in LUCENE-6522 is available for download.

The build levels are:

Java 8.0.1.10
Java 7.1.3.10

thanks for being patient with us.


(Ivan Brusic) #12

Great work everyone!

Ivan


(George T Chan) #13

Hi Robert
Are we at a point where it makes sense to submit a pull request to adjust the lucene bootup logic, something to just fail on only older J9 versions (< 2.8)?


#14

I committed such a thing a while back: https://github.com/rmuir/elasticsearch/commit/077b9e0e5859dfecdde2901a07dc741ebde801d8


(Mesbah Alam) #15

Hi Robert,

Thank you for all your help so far. Regarding your code-change to ensure installation doesn't fail for 28 or above - could you please let us know which version of ES it be available in and when? Can we test it using ES 1.7?


(Ramkumar Ramalingam) #16

Great progress on the IBM JDK issues reported in this thread, thanks Robert and George. Appreciate all your efforts here.

With lot of excitement, just picked up the latest elasticsearch v1.7.1 and the latest IBM Java 7.1.3.10, assuming that all the issues are now resolved and appropriate changes are reflected in the code base.

Just realized that, the below mentioned initialization issue is still noticed.

{1.7.1}: Initialization Failed ...

Please let us know, if we are missing something here. Thanks.


#17

I tried the ES v1.7.1 with IBM java 7, still got the same error.
Checked the code, there is a logic like that blocked my test:
124 } else if ("IBM Corporation".equals(Constants.JVM_VENDOR)) {
125 // currently any JVM from IBM will easily result in index corruption.
126 StringBuilder sb = new StringBuilder();
127 sb.append("IBM runtimes suffer from several bugs which can cause data corruption.");
128 sb.append(System.lineSeparator());
129 sb.append("Please upgrade the JVM, see ").append(JVM_RECOMMENDATIONS);
130 sb.append(" for current recommendations.");
131 throw new RuntimeException(sb.toString());
132 }
133 }

is there a way to workaround this?


(Mesbah Alam) #18

Hi Robert,

I tried today with the latest ES 1.7.2 with IBM Java 8 and still got the same error:

C:\elastisearch\elasticsearch-1.7.2\bin>elasticsearch
{1.7.2}: Initialization Failed ...

It appears that the code checked into github (https://github.com/rmuir/elasticsearch/commit/077b9e0e5859dfecdde2901a07dc741ebde801d8) to allow IBM Java 8 to be used by ES did not become live yet. Is there any estimated time when this will become live?

Can you please apply the same fix for Java 7 as well? (IBM fixed the issue on both Java 7 and Java 7 as reported above)


(Mesbah Alam) #20

Verified that the issue has been resolved in Elasticsearch release 2.3.3. Thank you.


(system) closed #21