ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

oh awesome! thanks for the feedback!

simon

On Thursday, April 11, 2013 8:00:56 AM UTC+2, Dong Aihua wrote:

Hi, Simon:
Today, I upgraded the es cluster from 0.19.11 to 0.20.6 successfully.
The upgrade is smooth.
And this problem doesn't see again.
Thank you.

在 2013年4月1日星期一UTC+8上午11时21分49秒,Dong Aihua写道:

Hi, Simon:
I tested the latest trunk code, it also has no problem. The logs are
located at git@github.com:dongaihua/shares.git
By the way, we ever met TooManyOpenFiles problem, but after we change
the ulimit -n , we don't meet that problems again.
After you check the logs, can you give me a confirmation if the problem
is really resolved?
Thank you.

在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道:

Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:

Hmm that is interesting... I suspect that the fix for the missed
ChannelClosed events fixed it then though....I don't have you logs around
anymore but did you see any TooManyOpenFiles exceptions by any chance? Can
you confirm that the latest master has fixed this as well by running a
master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed
in this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the
problem is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the
physical connections. If you really need those informaiton, I can ask and
give you response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering
what triggered the intermediate disconnects that causes the servers to wait
on their peers they recover from. Can you tell a little about your setup,
are they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the
reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of
disconnects going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all
the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the
340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my
test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during
recovery when starting a replica which causes one of the machines to wait
for an answer but it doesn't come back and it doesn't seem to get notified
that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it
has the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is
the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the
    attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4
and 0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh
    cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15
    nodes are data
    nodes
  2. After the cluster is up, I tried to create some
    empty indices for
    example index-2013-02-25, index-2013-02-26,
    index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long
    time.

Please can you open an issue on github, with a full
recreation of what
you need to do to recreate this problem, plus all the logs
from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.