Cluster state red, requests timed out, no error in log?

maik2102 · August 20, 2012, 1:40pm

Hi together,

I just tested with lots of indices, updates and so on.
Everything ran great. Then suddenly the ClusterState went to red,
connections on my application timed out.
I searched through the log and interpreted some hints, but there is no real
error, why some shards got unassigned and so turned the ClusterState to red.

Index creation: 2 Shards, 1 Repica each. 2 Servers, 45GB heap each.

{

"cluster_name" : "elasticsearch",

"status" : "red",

"timed_out" : false,

"number_of_nodes" : 2,

"number_of_data_nodes" : 2,

"active_primary_shards" : 1932,

"active_shards" : 3864,

"relocating_shards" : 0,

"initializing_shards" : 2,

"unassigned_shards" : 2

}

Even worse: Many many threads in my application stuck at one point:

java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f2b03c60ae8> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:905)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1217)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:271)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
at sun.reflect.GeneratedMethodAccessor1210.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)

The mentioned threads all got triggered by this call, where indexName is a
local parameter: IndicesExistsResponse existsResponse =
client.admin().indices().prepareExists(indexName).execute().actionGet()

Should'nt this return immediately?

Some further information and the hints I gathered from the logs:

Log snipped, where last_flush_time is mentioned. Is it right, that in this
example it takes to long for flushing the indices?
[2012-08-17 10:57:54,852][TRACE][index.translog ] [es_server_1]
[index_A][1] flushing translog, last_flush_time [1345192072951], breached
[30m]
[2012-08-17 10:57:54,852][TRACE][index.translog ] [es_server_1]
[index_A][0] flushing translog, last_flush_time [1345192073057], breached
[30m]
[2012-08-17 10:57:54,852][TRACE][index.shard.service ] [es_server_1]
[index_A][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:54,852][TRACE][index.shard.service ] [es_server_1]
[index_A][0] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,845][TRACE][index.translog ] [es_server_1]
[index_B][0] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,845][TRACE][index.shard.service ] [es_server_1]
[index_B][0] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,845][TRACE][index.translog ] [es_server_1]
[index_C][1] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,845][TRACE][index.shard.service ] [es_server_1]
[index_C];de][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,847][TRACE][index.translog ] [es_server_1]
[index_D][1] flushing translog, last_flush_time [1345192076266], breached
[30m]
[2012-08-17 10:57:59,847][TRACE][index.shard.service ] [es_server_1]
[index_D][1] flush with full[false], refresh[false], force[false]
[2012-08-17 10:57:59,849][TRACE][index.translog ] [es_server_1]
[index_D][0] flushing translog, last_flush_time [1345192076266], breached
[30m]

Timeout Exception in my application. I don't know what triggered it or what
was tried in background, because the lines are surrounded by others,
nothing to do with the search part.
2012-08-19 10:59:57: [org.elasticsearch.client.transport] - [Ent] failed to
get node info for [#transport#-2][inet[/192.168.1.123:9300]],
disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[/
192.168.1.123:9300]][cluster/nodes/info] request_id [24067] timed out after
[5000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

If you need further information, please let me know.
I appreciate any help I can get

Thanks in advance
Greetings
Maik

--

maik2102 · August 20, 2012, 1:47pm

Just found more lines, which I think are not normal. Lots of lines like
this:
[2012-08-17 10:58:31,602][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x1d66a350, /application_server1:34987 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,608][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x3d6be844, /application_server1:34988 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,608][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x362159bd, /application_server1:34989 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,609][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x7991bd49, /application_server1:34990 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,609][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x5d277762, /application_server1:34991 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x4f2241f9, /application_server1:34992 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x1b606f0f, /application_server1:34993 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,610][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x6915d2d5, /application_server1:34994 :>
/192.168.1.123:9300]
[2012-08-17 10:58:31,611][TRACE][transport.netty ] [es_server_2]
channel closed: [id: 0x7b3057bf, /application_server1:34995 :>
/192.168.1.123:9300]
[2012-08-17 10:58:32,450][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x081f0f92, /application_server2:49168 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x53e2360f, /application_server2:49169 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x284b3dcf, /application_server2:49170 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x1b2cedf2, /application_server2:49171 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x4f13ee31, /application_server2:49172 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,451][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x2d172da3, /application_server2:49174 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,452][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x0f5941ab, /application_server2:49173 =>
/192.168.1.123:9300]
[2012-08-17 10:58:32,452][TRACE][transport.netty ] [es_server_2]
channel opened: [id: 0x42de687b, /application_server2:49175 =>
/192.168.1.123:9300]

--

Topic		Replies	Views
Inserting new index fails, cluster goes Red Elasticsearch	1	388	November 2, 2018
Cluster red, unassigned shards, no response on writes Elasticsearch	17	691	February 22, 2023
Cluster turns to red after reboot Elasticsearch	29	2768	January 4, 2019
Cluster Red - unallocated shards in a index Elasticsearch	3	1848	January 30, 2018
Cluster health red Elasticsearch	4	445	July 6, 2017

Cluster state red, requests timed out, no error in log?

Related topics