Red status after unexpected stop


(Arnau Arnauito) #1

Hello all,

I did ask similar question to logstash mailing-list, but as I've debugging
the problem and it's 100% EL related I've decided to ask for help here.

I have a 1 node EL cluster filled from logstash. Everything was working
fine but ast week the node suffered an unexpected powercut and now the
cluster is in red status:

{
cluster_name: BigLog
status: red
timed_out: false
number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 1175
active_shards: 1175
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 1195
}

The first start after the crash gave me no error:

[2014-05-06 11:44:05,867][INFO ][node ] [BigLog1]
version[0.90.10], pid[32758], build[0a5781f/2014-01-10T10:18:37Z]
[2014-05-06 11:44:05,868][INFO ][node ] [BigLog1]
initializing ...
[2014-05-06 11:44:05,875][INFO ][plugins ] [BigLog1]
loaded [], sites []
[2014-05-06 11:49:30,318][INFO ][node ] [BigLog1]
version[0.90.10], pid[1212], build[0a5781f/2014-01-10T10:18:37Z]
[2014-05-06 11:49:30,318][INFO ][node ] [BigLog1]
initializing ...
[2014-05-06 11:49:30,325][INFO ][plugins ] [BigLog1]
loaded [], sites []
[2014-05-06 11:50:16,450][INFO ][node ] [BigLog1]
initialized
[2014-05-06 11:50:16,450][INFO ][node ] [BigLog1]
starting ...
[2014-05-06 11:50:16,611][INFO ][transport ] [BigLog1]
bound_address {inet[/172.17.20.4:9301]}, publish_address
{inet[/172.17.20.4:9301]}
[2014-05-06 11:50:19,652][INFO ][cluster.service ] [BigLog1]
new_master [BigLog1][TVlZ5MjhRTm21c6iu57NbA][inet[/172.17.20.4:9301]],
reason: zen-disco-join (elected_as
_master)
[2014-05-06 11:50:19,676][INFO ][discovery ] [BigLog1]
BigLog/TVlZ5MjhRTm21c6iu57NbA
[2014-05-06 11:50:20,095][INFO ][http ] [BigLog1]
bound_address {inet[/172.17.20.4:9200]}, publish_address
{inet[/172.17.20.4:9200]}
[2014-05-06 11:50:20,096][INFO ][node ] [BigLog1]
started
[2014-05-06 11:50:23,024][INFO ][gateway ] [BigLog1]
recovered [235] indices into cluster_state
[2014-05-06 11:50:23,024][INFO ][cluster.service ] [BigLog1] added
{[Stentor][fqar3OoSSSOPMgPHvNbV3w][inet[/172.17.20.4:9300]]{client=true,
data=false},}, reason: zen
-disco-receive(join from
node[[Stentor][fqar3OoSSSOPMgPHvNbV3w][inet[/172.17.20.4:9300]]{client=true,
data=false}])
[2014-05-06 11:50:23,150][WARN ][transport.netty ] [BigLog1]
Message not fully read (request) for [30] and action
[cluster/nodeIndexCreated], resetting
[...]
[2014-05-06 11:50:49,899][DEBUG][action.search.type ] [BigLog1] All
shards failed for phase: [query]
[2014-05-06 11:50:49,901][DEBUG][action.search.type ] [BigLog1] All
shards failed for phase: [query]
[2014-05-06 11:52:14,027][INFO ][cluster.service ] [BigLog1]
removed
{[Stentor][fqar3OoSSSOPMgPHvNbV3w][inet[/172.17.20.4:9300]]{client=true,
data=false},}, reason:
zen-disco-node_failed([Stentor][fqar3OoSSSOPMgPHvNbV3w][inet[/172.17.20.4:9300]]{client=true,
data=false}), reason transport disconnected (with verified connect)

After that start, a new restart said absolutly nothing strange:

[2014-05-06 14:28:19,151][INFO ][node ] [BigLog1]
version[0.90.10], pid[9441], build[0a5781f/2014-01-10T10:18:37Z]
[2014-05-06 14:28:19,152][INFO ][node ] [BigLog1]
initializing ...
[2014-05-06 14:28:19,158][INFO ][plugins ] [BigLog1]
loaded [], sites []
[2014-05-06 14:28:36,283][INFO ][node ] [BigLog1]
initialized
[2014-05-06 14:28:36,283][INFO ][node ] [BigLog1]
starting ...
[2014-05-06 14:28:36,440][INFO ][transport ] [BigLog1]
bound_address {inet[/172.17.20.4:9300]}, publish_address
{inet[/172.17.20.4:9300]}
[2014-05-06 14:28:39,479][INFO ][cluster.service ] [BigLog1]
new_master [BigLog1][XXeD06hgQZKks-ChqVh-5w][inet[/172.17.20.4:9300]],
reason: zen-disco-join (elected_as_master)
[2014-05-06 14:28:39,517][INFO ][discovery ] [BigLog1]
BigLog/XXeD06hgQZKks-ChqVh-5w
[2014-05-06 14:28:39,760][INFO ][http ] [BigLog1]
bound_address {inet[/172.17.20.4:9200]}, publish_address
{inet[/172.17.20.4:9200]}
[2014-05-06 14:28:39,760][INFO ][node ] [BigLog1]
started
[2014-05-06 14:28:41,788][INFO ][gateway ] [BigLog1]
recovered [237] indices into cluster_state

But the status is red :frowning:

So, I've enabled debug and I've tried to understand what happening with
shards, so I've looked for one into log:
[2014-05-07 11:38:52,188][DEBUG][gateway.local ] [BigLog1]
[logstash-2014.02.17][0]: throttling allocation [[logstash-2014.02.17][0],
node[null], [P], s[UNASSIGNED]] to
[[[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]]]] on primary
allocation
[...]
[2014-05-07 11:39:15,362][DEBUG][gateway.local ] [BigLog1]
[logstash-2014.02.17][0]: throttling allocation [[logstash-2014.02.17][0],
node[null], [P], s[UNASSIGNED]] to
[[[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]]]] on primary
allocation
[2014-05-07 11:39:15,551][DEBUG][gateway.local ] [BigLog1]
[logstash-2014.02.17][0]: allocating [[logstash-2014.02.17][0], node[null],
[P], s[UNASSIGNED]] to
[[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]]] on primary
allocation
[2014-05-07 11:39:15,638][DEBUG][indices.cluster ] [BigLog1]
[logstash-2014.02.17][0] creating shard
[2014-05-07 11:39:15,725][DEBUG][index.deletionpolicy ] [BigLog1]
[logstash-2014.02.17][0] Using [keep_only_last] deletion policy
[2014-05-07 11:39:15,726][DEBUG][index.merge.policy ] [BigLog1]
[logstash-2014.02.17][0] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2014-05-07 11:39:15,726][DEBUG][index.merge.scheduler ] [BigLog1]
[logstash-2014.02.17][0] using [concurrent] merge scheduler with
max_thread_count[3]
[2014-05-07 11:39:15,726][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] state: [CREATED]
[2014-05-07 11:39:15,727][DEBUG][index.translog ] [BigLog1]
[logstash-2014.02.17][0] interval [5s], flush_threshold_ops [5000],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-05-07 11:39:15,727][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-05-07 11:39:15,727][DEBUG][index.gateway ] [BigLog1]
[logstash-2014.02.17][0] starting recovery from local ...
[2014-05-07 11:39:15,732][DEBUG][index.engine.robin ] [BigLog1]
[logstash-2014.02.17][0] starting engine
[2014-05-07 11:39:15,768][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] state: [RECOVERING]->[POST_RECOVERY], reason [post
recovery]
[2014-05-07 11:39:15,768][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] scheduling refresher every 5s
[2014-05-07 11:39:15,769][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] scheduling optimizer / merger every 1s
[2014-05-07 11:39:15,769][DEBUG][index.gateway ] [BigLog1]
[logstash-2014.02.17][0] recovery completed from [local], took [42ms]
[2014-05-07 11:39:15,769][DEBUG][cluster.action.shard ] [BigLog1]
sending shard started for [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [after recovery from gateway]
[2014-05-07 11:39:15,769][DEBUG][cluster.action.shard ] [BigLog1]
received shard started for [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [after recovery from gateway]
[2014-05-07 11:39:15,953][DEBUG][cluster.action.shard ] [BigLog1]
sending shard started for [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]
[2014-05-07 11:39:15,953][DEBUG][cluster.action.shard ] [BigLog1]
received shard started for [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]
[2014-05-07 11:39:15,966][DEBUG][cluster.action.shard ] [BigLog1]
[logstash-2014.02.17][0] will apply shard started [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [after recovery from gateway]
[2014-05-07 11:39:15,966][DEBUG][cluster.action.shard ] [BigLog1]
[logstash-2014.02.17][0] will apply shard started [logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING], indexUUID
[pHG4k-OJQFeCiCfrhf59Iw], reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]
[2014-05-07 11:39:16,210][DEBUG][index.shard.service ] [BigLog1]
[logstash-2014.02.17][0] state: [POST_RECOVERY]->[STARTED], reason [global
state is [STARTED]]
[2014-05-07 11:39:18,477][DEBUG][index.engine.robin ] [BigLog1]
[logstash-2014.02.17][0] updating index_buffer_size from [64mb] to [4mb]
[2014-05-07 11:40:17,497][DEBUG][cluster.service ] [BigLog1]
processing [shard-started ([logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING]), reason [after recovery
from gateway]]: execute
[2014-05-07 11:40:17,526][DEBUG][cluster.service ] [BigLog1]
cluster state updated, version [329], source [shard-started
([logstash-2014.02.17][0], node[r4MP-9AYTrKkzva1IoXzDw], [P],
s[INITIALIZING]), reason [after recovery from gateway]]
[2014-05-07 11:40:17,740][DEBUG][cluster.service ] [BigLog1]
processing [shard-started ([logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING]), reason [after recovery
from gateway]]: done applying updated cluster_state (version: 329)
[2014-05-07 11:40:18,302][DEBUG][cluster.service ] [BigLog1]
processing [shard-started ([logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING]), reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]]:
execute
[2014-05-07 11:40:18,329][DEBUG][cluster.service ] [BigLog1]
cluster state updated, version [332], source [shard-started
([logstash-2014.02.17][0], node[r4MP-9AYTrKkzva1IoXzDw], [P],
s[INITIALIZING]), reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]]
[2014-05-07 11:40:18,504][DEBUG][cluster.service ] [BigLog1]
processing [shard-started ([logstash-2014.02.17][0],
node[r4MP-9AYTrKkzva1IoXzDw], [P], s[INITIALIZING]), reason [master
[BigLog1][r4MP-9AYTrKkzva1IoXzDw][inet[/172.17.20.4:9300]] marked shard as
initializing, but shard state is [POST_RECOVERY], mark shard as started]]:
done applying updated cluster_state (version: 332)
[2014-05-07 12:09:48,982][DEBUG][indices.memory ] [BigLog1]
marking shard [logstash-2014.02.17][0] as inactive (inactive_time[30m])
indexing wise, setting size to [500kb]
[2014-05-07 12:10:13,726][DEBUG][index.engine.robin ] [BigLog1]
[logstash-2014.02.17][0] updating index_buffer_size from [4mb] to
(inactive) [500kb]

Could someone help me to understand what happening with this (an other)
shard?

TIA,
Arnau

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cae1a071-801f-4f0b-aecd-7e2780180a5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Arnau Arnauito) #2

Hello,

just an update as I was able to solve the issue.
The health was:

cluster_name: BigLog
status: red
timed_out: false
number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 1175
active_shards: 1175
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 1195

And the amount of unassigned shards came from the number of replicas:
1 and number of nodes 1 . So , a copy of each shard was expecting a new
node for being replicated.

For solving that, I did remove the repica of each index i.e :

curl -XPUT "http://XXXXXXX:9200/logstash-2014.05.01/_settings?" -d '{
"index" : {
"number_of_replicas" : 0 } }
'

Then, I saw that I still had 280 unassigned shards. I got their indexes
by doing:

curl XXXXXX:9200/_cluster/state

and then looking for routing_nodes.

As log did not say anything about them, I decided to remove those
indexes:

curl -XDELETE "http://XXXXXX:9200/$a/"

and the refresh:

curl -XPOST "http://XXXXX:9200/$a/_refresh"

now my cluster is in green (as I've removed the replicas) and logstash
is working .

{
cluster_name: BigLog
status: green
timed_out: false
number_of_nodes: 1
number_of_data_nodes: 1
active_primary_shards: 1200
active_shards: 1200
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Cheers,
Arnau

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/20140508151154.6fb5d82a%40eidolon.
For more options, visit https://groups.google.com/d/optout.


(system) #3