Elasticsearch v.20 RC1 and IndexRequest.writeTo

Hi,

I came across an issue when doing some load testing (inserts only), my JVM
actually crashed, output:

JRE version: 7.0_07-b10

Java VM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01 mixed mode

linux-amd64 compressed oops)

Problematic frame:

J org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF

(Ljava/lang/String;)V

I noticed that now this method is now deprecated. And the
IndexRequest.writeTo no longer uses writeUTF in v.20

I can upgrade my transportClient to .20 but when I try to run this with an
.19.8 ES server I get
*org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException:
transport content length received [1gb] exceeded [906.6mb]
*

When I upgrade my server to .20 then it works fine.

Is this expected to not work? If we upgrade our client to .20 is it not
backward compatible with .19?

Thanks.

--

Both Java client and server need to be on the same major ES version. The failure you got is strange though, are you sure you don't have mixups in your classpath?

On Oct 23, 2012, at 6:44 PM, emj11 jagdhar@gmail.com wrote:

Hi,

I came across an issue when doing some load testing (inserts only), my JVM actually crashed, output:

JRE version: 7.0_07-b10

Java VM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01 mixed mode linux-amd64 compressed oops)

Problematic frame:

J org.elasticsearch.common.io.stream.HandlesStreamOutput.writeUTF(Ljava/lang/String;)V

I noticed that now this method is now deprecated. And the IndexRequest.writeTo no longer uses writeUTF in v.20

I can upgrade my transportClient to .20 but when I try to run this with an .19.8 ES server I get
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: transport content length received [1gb] exceeded [906.6mb]

When I upgrade my server to .20 then it works fine.

Is this expected to not work? If we upgrade our client to .20 is it not backward compatible with .19?

Thanks.

--

--

Hi, I've upgraded both server and client with .20 RC1 and now I get a crash
at:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fe6ad10f671, pid=31164, tid=140627433756416

JRE version: 7.0_07-b10

Java VM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01 mixed mode

linux-amd64 compressed oops)

Problematic frame:

J

org.elasticsearch.common.io.stream.StreamOutput.writeOptionalString(Ljava/lang/String;)V

Core dump written. Default location: /data/grid/logs/core or core.31164

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

I do have a 12gb core dump, but having a hard time reading it through
jstack or jmap. I plan to attach to the process thru gdb. will update
with results later.

--

Hi,

can you please gist the whole hs_err_pid.log crash file somewhere, so it is
possible to review it to figure out whether it's likely a JVM or an ES
issue. Large core dump is not required. You will have little luck with gdb,
it's the Java JVM.

Thanks,

Jörg

On Thursday, October 25, 2012 1:04:14 AM UTC+2, emj11 wrote:

Hi, I've upgraded both server and client with .20 RC1 and now I get a
crash at:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fe6ad10f671, pid=31164, tid=140627433756416

JRE version: 7.0_07-b10

Java VM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01 mixed mode

linux-amd64 compressed oops)

Problematic frame:

J

org.elasticsearch.common.io.stream.StreamOutput.writeOptionalString(Ljava/lang/String;)V

Core dump written. Default location: /data/grid/logs/core or core.31164

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

I do have a 12gb core dump, but having a hard time reading it through
jstack or jmap. I plan to attach to the process thru gdb. will update
with results later.

--

I have experienced large head dumps when using different versions of
Elasticsearch.

--
Ivan

On Wed, Oct 24, 2012 at 4:36 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

can you please gist the whole hs_err_pid.log crash file somewhere, so it
is possible to review it to figure out whether it's likely a JVM or an ES
issue. Large core dump is not required. You will have little luck with gdb,
it's the Java JVM.

Thanks,

Jörg

On Thursday, October 25, 2012 1:04:14 AM UTC+2, emj11 wrote:

Hi, I've upgraded both server and client with .20 RC1 and now I get a
crash at:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fe6ad10f671, pid=31164, tid=140627433756416

JRE version: 7.0_07-b10

Java VM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01 mixed mode

linux-amd64 compressed oops)

Problematic frame:

J org.elasticsearch.common.io.stream.StreamOutput.

writeOptionalString(Ljava/**lang/String;)V

Core dump written. Default location: /data/grid/logs/core or core.31164

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/**bugreport/crash.jsphttp://bugreport.sun.com/bugreport/crash.jsp

I do have a 12gb core dump, but having a hard time reading it through
jstack or jmap. I plan to attach to the process thru gdb. will update
with results later.

--

--

Here is the hs_err log

I was going to use gdb to get a full stack trace...

--

Hi Ivan, appreciate you taking a look at this. I'm just learning the ropes
here with jvm crashes...so it's taking me a lot longer.

Thanks.

--

Hi,

at a first glance, the JVM halts because of a problem with the optimizer,
computing the string arg seems a challenge. First aid, please remove the
JVM option -XX:+AggressiveOpts.
Interestingly, it reminds me of JVM bugs that Lucene had exposed some year
ago, e.g. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6942326

Cheers, Jörg

On Thursday, October 25, 2012 5:03:48 PM UTC+2, emj11 wrote:

Here is the hs_err log
ES-JVM crash · GitHub

I was going to use gdb to get a full stack trace...

--

Thanks Jorg. Was away on vacation, back now and looking at this again.

I have deduced that the issue only occurs with Coherence. Meaning I wrote
a small app just to do massive inserts through multiple transport clients
and ran it on the same machine and had no issues.

I have a cache in Coherence which contains all of my objects, for each put
into the cache a write-behind (async) to a data store (in this case ES) is
triggered. It's in this case that I see the JVM crash after doing about 9K
write-behinds across 3 transport clients.
In the coherence class path I did find an older version of lucene which I
thought may have caused the issue, but I have deleted those jars and still
see the JVM crash.
This crash happens almost every time when running with Coherence.

It may be another jar conflict lingering somewhere.
Any other suggestions are more then welcome.

--

Hi,

yes, I noticed in the crash dump report Coherence is present, but there
shouldn't be such a serious JVM fault, with either Coherence or ES or both
installed, so that's strange.

If you did additional OS tuning, it is hard to trace down if the JVM
suffers from that.

One guess for the cause of the crash is, the JVM (also the NIO) memory has
been completey consumed on that machine, and the string can not be
fetched/stored for parameter evaluation by the Hot Spot compiler, before an
OOM condition could be detected. That is common challenge for memory
management systems under high load pressure. Aand you have AggressiveOpts
enabled, i.e experimental JVM code is active. Without deeper knowledge of
the specific JVM implementation it is not possible to fix such issues.

A strategy for working around such a JVM issue, beside disabling the
AggressiveOpts, could be to identify the programs that are allocating the
NIO memory and trying to reduce the consumption, therefore reducing the
possibility of memory pressure. So the chance of a JVM crash could also be
reduced.

Just a side note, it should be sufficient to use a TransportClient as a
singleton. Is there any reason you run multiple transport clients? A
singleton consumes less sockets and less memory, also less NIO memory.

Jörg

--

HI Jorg, so removing -XX:+AggressiveOpts did not help, however by removing
all of the GC options I could not reproduce the crash. We have a lot of
options set including G1GC, there may be some sort of conflict. Our next
step is to take a closer look at these GC options.

A coworker of mine looked at the core file and noted that there was no
third party native code loaded at the time of the crash, he noted they were
all standard JDK libraries and Linux C runtimes / shared objects. Which
made us think it was a memory management issue.

The theory is that heap may have been GC'ed instead of moved, along with
its references.

--

Meant to say that the byte array (used by writeUTF) on the heap may have
been removed by GC, rather than moved.

--

I agree, I appreciate your feedback, nice to have it identified by dropping
the GC JVM options. Since the GC implementations in Java 7 are quite new,
and others could encounter the same issue in the future, you might feel
comfortable to fill out a Java 7 bug report at
Oracle http://bugreport.sun.com/bugreport/

Best regards,

Jörg

On Wednesday, November 21, 2012 11:02:58 PM UTC+1, emj11 wrote:

Meant to say that the byte array (used by writeUTF) on the heap may have
been removed by GC, rather than moved.

--

There were quite a few options being used.....When I removed UseG1GC the
crash no longer happened.

Has anyone tested ES with Java 7's new GC? I am using 19.8.

--

Just one more reason the bug should be reported!

The G1 GC is quite new, although it had been declared stable since Java 7u4
. Yes, I'm interested in finding recommendable GC JVM settings for
Elasticsearch workloads on current JVMs, so this news is exciting to me.

Best regards,

Jörg

On Wednesday, November 28, 2012 9:30:26 PM UTC+1, emj11 wrote:

There were quite a few options being used.....When I removed UseG1GC the
crash no longer happened.

Has anyone tested ES with Java 7's new GC? I am using 19.8.

--