0.19.11 client node in solaris weblogic web app crashes jvm after creating hundreds of transport_client_worker threads

Hi,

  1. I am upgrading from 0.17.x to 0.19.11. Have deleted old index files.
    Stand alone jvm Elastic Search server starts fine. It connects to
    standalone jvm client node and indexes data without any problem.

  2. A 0.19.11 client node inside solaris weblogic web app connecting to the
    above server node lands up creating hundreds of tranport_client_worker
    threads and crashes weblogic.

  3. I see hundreds of these lines in the crash log:
    "0x00000001040d1000 JavaThread "elasticsearch[Oneg the
    Prober][transport_server_worker][T#255]{New I/O worker #511}" daemon
    [_thread_in_native, id=797, stack(0xfffffffe9f400000,0xfffffffe9f500000)]"
    followed by hundreds of these:
    "0x0000000103def000 JavaThread "ExecuteThread: '126' for queue:
    'weblogic.socket.Muxer'" daemon [_thread_blocked, id=272,
    stack(0xfffffffee2400000,0xfffffffee2500000)]"

  4. My weblogic log shows below:
    [INFO ][10-Dec 10:00:10,717][][node][Oneg the Prober] {0.19.11}[13351]:
    starting ...
    [INFO ][10-Dec 10:00:12,984][][transport][Oneg the Prober] bound_address
    {inet[/169.49.110.160:9403]}, publish_address {inet[/169.49.110.160:9403]}

An unexpected error has been detected by Java Runtime Environment:

SIGBUS (0xa) at pc=0xffffffff7e2fe008, pid=13351, tid=560

Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b22 mixed mode

solaris-sparc)

Problematic frame:

V [libjvm.so+0x6fe008]

An error report file with more information is saved as:

/app/securities/rapport/bin/hs_err_pid13351.log

If you would like to submit a bug report, please visit:

http://java.sun.com/webapps/bugreport/crash.jsp

Stack: [0xfffffffebce00000,0xfffffffebcf00000], sp=0xfffffffebcefc910,
free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
V [libjvm.so+0x6fe008]

Any help would be appreciated.

--

In point (1) below, I am also using "compress.lzf.decoder: safe" on the
standalone ES server

On Monday, 10 December 2012 15:29:24 UTC, ABC wrote:

Hi,

  1. I am upgrading from 0.17.x to 0.19.11. Have deleted old index files.
    Stand alone jvm Elastic Search server starts fine. It connects to
    standalone jvm client node and indexes data without any problem.

  2. A 0.19.11 client node inside solaris weblogic web app connecting to the
    above server node lands up creating hundreds of tranport_client_worker
    threads and crashes weblogic.

  3. I see hundreds of these lines in the crash log:
    "0x00000001040d1000 JavaThread "elasticsearch[Oneg the
    Prober][transport_server_worker][T#255]{New I/O worker #511}" daemon
    [_thread_in_native, id=797, stack(0xfffffffe9f400000,0xfffffffe9f500000)]"
    followed by hundreds of these:
    "0x0000000103def000 JavaThread "ExecuteThread: '126' for queue:
    'weblogic.socket.Muxer'" daemon [_thread_blocked, id=272,
    stack(0xfffffffee2400000,0xfffffffee2500000)]"

  4. My weblogic log shows below:
    [INFO ][10-Dec 10:00:10,717][node][Oneg the Prober] {0.19.11}[13351]:
    starting ...
    [INFO ][10-Dec 10:00:12,984][transport][Oneg the Prober] bound_address
    {inet[/169.49.110.160:9403]}, publish_address {inet[/169.49.110.160:9403
    ]}

An unexpected error has been detected by Java Runtime Environment:

SIGBUS (0xa) at pc=0xffffffff7e2fe008, pid=13351, tid=560

Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b22 mixed mode

solaris-sparc)

Problematic frame:

V [libjvm.so+0x6fe008]

An error report file with more information is saved as:

/app/securities/rapport/bin/hs_err_pid13351.log

If you would like to submit a bug report, please visit:

Bug Report

Stack: [0xfffffffebce00000,0xfffffffebcf00000], sp=0xfffffffebcefc910,
free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
V [libjvm.so+0x6fe008]

Any help would be appreciated.

--

Which Java do you ran on ES server, which on ES client side?

JVM 10.0-b22 is Java 1.6.0_06 (April 2008), more than four years old. Such
old java versions have many bugs that will stop you from running ES
successfully.

Please update to the latest Java 7. Note that Java 6 is scheduled end of
life for February, 2013, by Oracle.

Be aware, you can not mix Java 6 and Java 7 in ES client/server
configurations due to JVM object serialization issues.

If you are bound to an obsolete Java version, you could try to tweak the
JVM parameters so they do not trigger subtle bugs, but YMMV. It will be
hard and frustrating.

There are many transport client threads, as thread pools have been enlarged
in ES since 0.17. Hundreds of threads are way too much of course. I assume
you have many transport client instances open or the transport client has
difficulties to connect and starts threads in panic while retrying. But,
it's a secondary error, the primary error is you can't connect at all.

Best regards,

Jörg

--

Thanks Jorg,

I will try to get the jvm upgraded. Is it possible to limit the number of
threads the client node will generate while trying to connect to server? I
am assuming the thread pool settingshttp://www.elasticsearch.org/guide/reference/modules/threadpool.html are
only for the server nodes and not for client nodes. Please tell me if I am
wrong.

On Monday, 10 December 2012 17:24:22 UTC, Jörg Prante wrote:

Which Java do you ran on ES server, which on ES client side?

JVM 10.0-b22 is Java 1.6.0_06 (April 2008), more than four years old. Such
old java versions have many bugs that will stop you from running ES
successfully.

Please update to the latest Java 7. Note that Java 6 is scheduled end of
life for February, 2013, by Oracle.

Be aware, you can not mix Java 6 and Java 7 in ES client/server
configurations due to JVM object serialization issues.

If you are bound to an obsolete Java version, you could try to tweak the
JVM parameters so they do not trigger subtle bugs, but YMMV. It will be
hard and frustrating.

There are many transport client threads, as thread pools have been
enlarged in ES since 0.17. Hundreds of threads are way too much of course.
I assume you have many transport client instances open or the transport
client has difficulties to connect and starts threads in panic while
retrying. But, it's a secondary error, the primary error is you can't
connect at all.

Best regards,

Jörg

--

Is there a way to limit the number of threads used by client node? I am
assuming the thread pool settingshttp://www.elasticsearch.org/guide/reference/modules/threadpool.html are
only for the server nodes and not for client nodes.
another observation is that the client node connects to the server and then
immediately fails. The ES server node logs show:

[Threnody] added {[The Stepford Cuckoos*]*[-VIITrjqTPKRlNF9hz218Q][inet[/169.49.110.160:9402]]{client=true,
data=false},}, reason: zen-disco-receive(join from node[[The Stepford
Cuckoos][-VIITrjqTPKRlNF9hz218Q][inet[/169.49.110.160:9402]]{client=true,
data=false}])
[2012-12-12 07:12:13,727][INFO ][cluster.service ] [Threnody]
removed {[The Stepford Cuckoos
][-VIITrjqTPKRlNF9hz218Q][inet[/169.49.110.160:9402]]{client=true,
data=false},}, reason: zen-disco-node_failed([The Stepford
Cuckoos][-VIITrjqTPKRlNF9hz218Q][inet[/169.49.110.160:9402]]{client=true,
data=false}), reason transport disconnected (with verified connect)

On Wednesday, 12 December 2012 11:53:13 UTC, ABC wrote:

Thanks Jorg,

I will try to get the jvm upgraded. Is it possible to limit the number of
threads the client node will generate while trying to connect to server? I
am assuming the thread pool settingshttp://www.elasticsearch.org/guide/reference/modules/threadpool.html are
only for the server nodes and not for client nodes. Please tell me if I am
wrong.

On Monday, 10 December 2012 17:24:22 UTC, Jörg Prante wrote:

Which Java do you ran on ES server, which on ES client side?

JVM 10.0-b22 is Java 1.6.0_06 (April 2008), more than four years old.
Such old java versions have many bugs that will stop you from running ES
successfully.

Please update to the latest Java 7. Note that Java 6 is scheduled end of
life for February, 2013, by Oracle.

Be aware, you can not mix Java 6 and Java 7 in ES client/server
configurations due to JVM object serialization issues.

If you are bound to an obsolete Java version, you could try to tweak the
JVM parameters so they do not trigger subtle bugs, but YMMV. It will be
hard and frustrating.

There are many transport client threads, as thread pools have been
enlarged in ES since 0.17. Hundreds of threads are way too much of course.
I assume you have many transport client instances open or the transport
client has difficulties to connect and starts threads in panic while
retrying. But, it's a secondary error, the primary error is you can't
connect at all.

Best regards,

Jörg

--

Hi,

Server and client nodes share the threadpool settings, for simplicity of
code design. A TransportClient does not use all thread pool types, only a
few.

You can limit the thread pools for a TransportClient that is only indexing
data like this. Assume you want ten Netty connections and ten thread for
index or bulk:

ImmutableSettings.settingsBuilder().put("cluster.name","mycluster")
.put("client.transport.sniff", true)
.put("transport.netty.connections_per_node.low", 0)
.put("transport.netty.connections_per_node.med", 0)
.put("transport.netty.connections_per_node.high", 10)
.put("threadpool.search.type", "fixed")
.put("threadpool.search.size", "1")
.put("threadpool.get.type", "fixed")
.put("threadpool.get.size", "1")
.put("threadpool.index.type", "fixed")
.put("threadpool.index.size", "10")
.put("threadpool.bulk.type", "fixed")
.put("threadpool.bulk.size", "10")
.put("threadpool.refresh.type", "fixed")
.put("threadpool.refresh.size", "1")
.put("threadpool.percolate.type", "fixed")
.put("threadpool.percolate.size", "1")
.build();

Immediate disconnects have sometimes obscure reasons. Please enable the
DEBUG level in the logging, this can reveal more information.

For example, it may happen because of different JVM versions between client
and server.

Jörg

--

Jorg,

The final solution for my problem was:

  1. get both client and server nodes to run on the same jvm
  2. Add "compress.lzf.decoder: safe" setting to both client and server.

Without both of the above, the client continued to crash.

Without the above 2 settings the solaris client node uses and blocks all
available threads. Surely this must be a bug that needs to be addressed. I
will add the client thread settings you mentioned as a safety net.

Thank you! Your suggestions were really useful!

On Wednesday, 12 December 2012 13:13:43 UTC, Jörg Prante wrote:

Hi,

Server and client nodes share the threadpool settings, for simplicity of
code design. A TransportClient does not use all thread pool types, only a
few.

You can limit the thread pools for a TransportClient that is only indexing
data like this. Assume you want ten Netty connections and ten thread for
index or bulk:

ImmutableSettings.settingsBuilder().put("cluster.name","mycluster")
.put("client.transport.sniff", true)
.put("transport.netty.connections_per_node.low", 0)
.put("transport.netty.connections_per_node.med", 0)
.put("transport.netty.connections_per_node.high", 10)
.put("threadpool.search.type", "fixed")
.put("threadpool.search.size", "1")
.put("threadpool.get.type", "fixed")
.put("threadpool.get.size", "1")
.put("threadpool.index.type", "fixed")
.put("threadpool.index.size", "10")
.put("threadpool.bulk.type", "fixed")
.put("threadpool.bulk.size", "10")
.put("threadpool.refresh.type", "fixed")
.put("threadpool.refresh.size", "1")
.put("threadpool.percolate.type", "fixed")
.put("threadpool.percolate.size", "1")
.build();

Immediate disconnects have sometimes obscure reasons. Please enable the
DEBUG level in the logging, this can reveal more information.

For example, it may happen because of different JVM versions between
client and server.

Jörg

--