Need help with IBM JDK Issues with ES 1.4.5

rmuir · June 3, 2015, 1:04am

Hi,

I added this check, let me give you some background behind it. (this is all from the lucene project perspective)

With the way we test, we find lots of bugs in the jvm. These jenkins failures are time-consuming to look into: if its a bug like that, we can count on it eating up a couple days (and nights) of a few people's time. Typically we assume the bug is in our own code initially (which is always the most likely case), and then after finally realizing all that time was completely wasted, we are left digging deeper.

Usually these bugs aren't easy to reproduce: the worst ones don't crash, they just generate wrong code and give you bad computations. Additionally these bugs in the dynamic compilation process or garbage collection usually depend on timing. Sometimes the "repro" is to run the entire test suite in a loop with a certain random seed until you get lucky. When you finally get there and have a way to reproduce it, then what to do about it?

Getting the bugs to the right people has gotten much much better with openjdk/oracle. These days Oracle QA team works closely to give us regular snapshot builds, and channels communication to make sure our issues don't get lost: its much better! But I'm scarred from the past.

Once we get that far, even if someone is looking at it, doesn't mean it will get fixed easily or at all. Look how long it took to make repros on https://issues.apache.org/jira/browse/LUCENE-5212, and then the resulting debugging on https://bugs.openjdk.java.net/browse/JDK-8024830 (e.g. half-megabyte of assembler to look through). This one was additionally tricky because you needed sandy bridge CPU to really trigger it.

Maybe IBM JVM bugs like https://groups.google.com/forum/#!msg/elasticsearch/ZWWxoj8HC7E/X7rWo3zS5ZkJ are equally complicated!

So in order to prevent problems, its not that we can just give you a list of bugs, because we don't have them. It needs a whole process around a bunch of stuff:

Who will take ownership / get things moving / setup jenkins servers / etc
Which versions should be tested in continous builds? 7? 8?
Which options should be tested? 32/64bit? Which garbage collectors?
Who will look at the test failures and sort through/debug them?
How to track bugs against existing bugs already being worked / fixed?
How to report new bugs that show up?
How to get snapshot builds on a regular basis and find bugs before they make it into releases?

We just don't have anyone working on it for IBM. IBM got slowly worse and worse. At some point there were lots of parameters just disabling compilation to avoid a bunch of methods which would get miscompiled. The large amount of failures created a lot of noise and confusion, I remember screaming to just disable the IBM JDK from tests, which seems to have ultimately happened. Sometimes things would seem to get fixed (https://issues.apache.org/jira/browse/LUCENE-4987) but its not as transparent as the Oracle case, we really don't know what is happening.

Like the rest of open-source, you gotta scratch your own itch. If you really want IBM JDK to work, then grab that IBM JDK and see if you can get the lucene test suite passing several hundred/thousand times in a row. Triage the failures and work to get at least the critical ones fixed.

Even for the Oracle/openjdk case, its a struggle. I generally won't waste my time unless I think it can cause data corruption. If its just rare crashes, its not worth it. We really want to be writing search engines and not QA testing the JVM!

Topic		Replies	Views
Can ElasticSearch support IBM JVM? Elasticsearch	4	1051	July 6, 2017
JDK 7 Issues Question Elasticsearch	6	369	July 6, 2017
IBM jdk problems with Elasticsearch Elasticsearch	4	1011	July 5, 2017
Upgrade to logstash 1.3.3 and elasticsearch 0.90.9 IBM Java Elasticsearch	2	443	July 6, 2017
Elasticsearch on java7u55? Elasticsearch	4	526	July 6, 2017

Need help with IBM JDK Issues with ES 1.4.5

Related topics