Cluster state red, requests timed out, no error in log?


(maik2102) #1

Hi together,

I just tested with lots of indices, updates and so on.
Everything ran great. Then suddenly the ClusterState went to red,
connections on my application timed out.
I searched through the log and interpreted some hints, but there is no real
error, why some shards got unassigned and so turned the ClusterState to red.

Index creation: 2 Shards, 1 Repica each. 2 Servers, 45GB heap each.

{

"cluster_name" : "elasticsearch",

"status" : "red",

"timed_out" : false,

"number_of_nodes" : 2,

"number_of_data_nodes" : 2,

"active_primary_shards" : 1932,

"active_shards" : 3864,

"relocating_shards" : 0,

"initializing_shards" : 2,

"unassigned_shards" : 2

}

Even worse: Many many threads in my application stuck at one point:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f2b03c60ae8> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:905)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1217)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:271)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
at sun.reflect.GeneratedMethodAccessor1210.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)

The mentioned threads all got triggered by this call, where indexName is a
local parameter: IndicesExistsResponse existsResponse =
client.admin().indices().prepareExists(indexName).execute().actionGet()

Should'nt this return immediately?

Some further information and the hints I gathered from the logs:

Log snipped, where last_flush_time is mentioned. Is it right, that in this
example it takes to long for flushing the indices?
[2012-08-17 10:57:54,852][TRACE][index.translog ] [es_server_1]
[index_A][1] flushing translog, last_flush_time [1345192072951], breached
[30m]
[2012-08-17 10:57:54,852][TRACE][index.translog ] [es_server_1]
[index_A][0] flushing translog, last_flush_time [1345192073057], breached
[30m]
[2012-08-17 10:57:54,852][TRACE][index.shard.service ] [es_server_1]
[index_A][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:54,852][TRACE][index.shard.service ] [es_server_1]
[index_A][0] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,845][TRACE][index.translog ] [es_server_1]
[index_B][0] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,845][TRACE][index.shard.service ] [es_server_1]
[index_B][0] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,845][TRACE][index.translog ] [es_server_1]
[index_C][1] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,845][TRACE][index.shard.service ] [es_server_1]
[index_C];de][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,847][TRACE][index.translog ] [es_server_1]
[index_D][1] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,847][TRACE][index.shard.service ] [es_server_1]
[index_D][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,849][TRACE][index.translog ] [es_server_1]
[index_D][0] flushing translog, last_flush_time [1345192076266], breached
[30m]

Timeout Exception in my application. I don't know what triggered it or what
was tried in background, because the lines are surrounded by others,
nothing to do with the search part.
2012-08-19 10:59:57: [org.elasticsearch.client.transport] - [Ent] failed to
get node info for [#transport#-2][inet[/192.168.1.123:9300]],
disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[/
192.168.1.123:9300]][cluster/nodes/info] request_id [24067] timed out after
[5000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

If you need further information, please let me know.
I appreciate any help I can get :slight_smile:

Thanks in advance
Greetings
Maik

--


(maik2102) #2

Just found more lines, which I think are not normal. Lots of lines like
this:
[2012-08-17 10:58:31,602][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x1d66a350, /application_server1:34987 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,608][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x3d6be844, /application_server1:34988 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,608][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x362159bd, /application_server1:34989 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,609][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x7991bd49, /application_server1:34990 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,609][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x5d277762, /application_server1:34991 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x4f2241f9, /application_server1:34992 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x1b606f0f, /application_server1:34993 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x6915d2d5, /application_server1:34994 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,611][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x7b3057bf, /application_server1:34995 :>
/192.168.1.123:9300]
[2012-08-17 10:58:32,450][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x081f0f92, /application_server2:49168 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x53e2360f, /application_server2:49169 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x284b3dcf, /application_server2:49170 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x1b2cedf2, /application_server2:49171 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x4f13ee31, /application_server2:49172 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x2d172da3, /application_server2:49174 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,452][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x0f5941ab, /application_server2:49173 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,452][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x42de687b, /application_server2:49175 =>
/192.168.1.123:9300]

--


(system) #3