ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5 which
cause shards allocation failure and stuck in initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 nodes
    are master nodes, 2 nodes are load balancer, 15 nodes are data nodes
  2. After the cluster is up, I tried to create some empty indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.
    {
    "cluster_name" : "es-test",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 105,
    "active_shards" : 201,
    "relocating_shards" : 0,
    "initializing_shards" : 9,
    "unassigned_shards" : 0
    }
    Moreover when I created index-2013-03-02, the cluster became to red.
    {
    "cluster_name" : "es-test",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 119,
    "active_shards" : 228,
    "relocating_shards" : 0,
    "initializing_shards" : 11,
    "unassigned_shards" : 1
    }

I set the log level to trace, and checked the logs, no error logs are
shown. but from the logs, I can know some shards are initializing and
unassigned.
And I just tried 0.20.5, the same problem happened.
But for 0.19.11, the problem disappeared. All the empty indices can be
created successfully instantly even for some strange index names.
So I guess ES have some bugs for 0.20.4 and 0.20.5.
Can Kimchy or any other es experts have a check this problem?
Thank you very much!

-Regards-
-Dong Aihua-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

By the way, I tested the single node for 0.20.4, 0.20.5, such problem
doesn't happen.

在 2013年2月27日星期三UTC+9上午9时09分00秒,jackiedong写道:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5 which
cause shards allocation failure and stuck in initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 nodes
    are master nodes, 2 nodes are load balancer, 15 nodes are data nodes
  2. After the cluster is up, I tried to create some empty indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.
    {
    "cluster_name" : "es-test",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 105,
    "active_shards" : 201,
    "relocating_shards" : 0,
    "initializing_shards" : 9,
    "unassigned_shards" : 0
    }
    Moreover when I created index-2013-03-02, the cluster became to red.
    {
    "cluster_name" : "es-test",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 119,
    "active_shards" : 228,
    "relocating_shards" : 0,
    "initializing_shards" : 11,
    "unassigned_shards" : 1
    }

I set the log level to trace, and checked the logs, no error logs are
shown. but from the logs, I can know some shards are initializing and
unassigned.
And I just tried 0.20.5, the same problem happened.
But for 0.19.11, the problem disappeared. All the empty indices can be
created successfully instantly even for some strange index names.
So I guess ES have some bugs for 0.20.4 and 0.20.5.
Can Kimchy or any other es experts have a check this problem?
Thank you very much!

-Regards-
-Dong Aihua-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the same
problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214
and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the same
problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader;
214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I uploaded several times for the logs. All failed, I got the 340 error. I
will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader;
214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty indices
    for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error. I
will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader;
214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty indices
    for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error. I
will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader;
214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty indices
    for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from all
of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So, I looked closer at the latest logs and I see a lot of disconnects going
on giving me the impression you have some network issues. Nevertheless we
pushed some stuff to detect these situations earlier but non of us was able
to reproduce your issues. the only thing I can ask you for is to try again
with latest master to see if those commits helped in any way?
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error.
I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery when
starting a replica which causes one of the machines to wait for an answer
but it doesn't come back and it doesn't seem to get notified that the
connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty indices
    for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from all
of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon,

Which commits were they? 9a25867bfe154357165c87a7b509029ff832efa4? Curious
to see what has changed.

I have not looked at Dong's logs, but we have also experienced nodes being
removed from a cluster although the process is running. One possible
culprit is unresponsiveness due to GC.

--
Ivan

On Thu, Mar 7, 2013 at 3:04 AM, simonw simon.willnauer@elasticsearch.comwrote:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,**Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_**LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://**github.com/dongaihua/shareshttps://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,**Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error.
I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,**Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.**internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":**true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":**true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/**xxxxxyyyyy111http://10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,**simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "
indices.recovery.internal_**action_timeout: 30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue https://github.com/**
elasticsearch/elasticsearch/**issues/2714https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":**true}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":**false}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":**false}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,**Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up.
    3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from all
of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
The problem is if I switched to 0.19.11 version and with the same
environment, same test steps, the cluster works fine.
In fact, we have two test clusters(one is this which have 20 nodes,
another cluster has 25 nodes). Both clusters have the same problem. If
using 0.19.11 version, both clusters work fine. But if using 0.20.4, 0.20.5
or 0.90.0.Beta1, both clusters have the same problem.
I also tried to reproduce the problem with 2 nodes with several
instances, the problem didn't happen.
My test step is very simple. Just setup up the cluster and try to create
1~3 empty indices.
In fact, this problem caused our system crashed after we upgraded es from
0.19.11 to 0.20.4 and all the data are lost.
Thank you for your response.
@other people, By the way, did others setup up the large cluster with
0.20.4, 0.20.5 or 0.90.0.Beta1? Can you share your result?

-Regards-
-Dong Aihua-

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error.
I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up.
    3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from all
of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

By the way, for the servers disconnect at the end of logs, the reason is I
did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340 error.
I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has the
same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up.
    3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from all
of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thursday, March 7, 2013 10:38:30 PM UTC+1, Ivan Brusic wrote:

Hi Simon,

Which commits were they? 9a25867bfe154357165c87a7b509029ff832efa4? Curious
to see what has changed.

yeah that is what I referred to:

simon

I have not looked at Dong's logs, but we have also experienced nodes being
removed from a cluster although the process is running. One possible
culprit is unresponsiveness due to GC.

--
Ivan

On Thu, Mar 7, 2013 at 3:04 AM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,**Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_**LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://**github.com/dongaihua/shareshttps://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,**Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,**Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.**internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":**true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":**true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/**xxxxxyyyyy111http://10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,**simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "
indices.recovery.internal_**action_timeout: 30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue https://github.com/**
elasticsearch/elasticsearch/**issues/2714https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":**true}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":**false}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":**false}
    curl -XGET 10.96.250.211:10200/_cluster/**health?pretty=truehttp://10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,**Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up.
    3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the headsup, that is what I figured. I am just wondering what
triggered the intermediate disconnects that causes the servers to wait on
their peers they recover from. Can you tell a little about your setup, are
they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason is I
did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed connections
here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster up.
    3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the problem
is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the physical
connections. If you really need those informaiton, I can ask and give you
response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering what
triggered the intermediate disconnects that causes the servers to wait on
their peers they recover from. Can you tell a little about your setup, are
they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason is
I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster
    up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes are
    data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed in
this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the problem
is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the physical
connections. If you really need those informaiton, I can ask and give you
response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering what
triggered the intermediate disconnects that causes the servers to wait on
their peers they recover from. Can you tell a little about your setup, are
they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason is
I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster
    up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes
    are data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation
of what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hmm that is interesting... I suspect that the fix for the missed
ChannelClosed events fixed it then though....I don't have you logs around
anymore but did you see any TooManyOpenFiles exceptions by any chance? Can
you confirm that the latest master has fixed this as well by running a
master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed in
this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the
problem is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the
physical connections. If you really need those informaiton, I can ask and
give you response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering what
triggered the intermediate disconnects that causes the servers to wait on
their peers they recover from. Can you tell a little about your setup, are
they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason
is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster
    up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes
    are data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26, index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation
of what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:

Hmm that is interesting... I suspect that the fix for the missed
ChannelClosed events fixed it then though....I don't have you logs around
anymore but did you see any TooManyOpenFiles exceptions by any chance? Can
you confirm that the latest master has fixed this as well by running a
master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed in
this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the
problem is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the
physical connections. If you really need those informaiton, I can ask and
give you response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering
what triggered the intermediate disconnects that causes the servers to wait
on their peers they recover from. Can you tell a little about your setup,
are they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason
is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all
the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during
recovery when starting a replica which causes one of the machines to wait
for an answer but it doesn't come back and it doesn't seem to get notified
that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it
has the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the
    attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh cluster
    up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes
    are data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26,
    index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long
    time.

Please can you open an issue on github, with a full recreation
of what
you need to do to recreate this problem, plus all the logs
from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
I tested the latest trunk code, it also has no problem. The logs are
located at git@github.com:dongaihua/shares.git
By the way, we ever met TooManyOpenFiles problem, but after we change the
ulimit -n , we don't meet that problems again.
After you check the logs, can you give me a confirmation if the problem
is really resolved?
Thank you.

在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道:

Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:

Hmm that is interesting... I suspect that the fix for the missed
ChannelClosed events fixed it then though....I don't have you logs around
anymore but did you see any TooManyOpenFiles exceptions by any chance? Can
you confirm that the latest master has fixed this as well by running a
master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed
in this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the
problem is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the
physical connections. If you really need those informaiton, I can ask and
give you response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering
what triggered the intermediate disconnects that causes the servers to wait
on their peers they recover from. Can you tell a little about your setup,
are they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the reason
is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of
disconnects going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all
the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during
recovery when starting a replica which causes one of the machines to wait
for an answer but it doesn't come back and it doesn't seem to get notified
that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it
has the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the
    attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4
and 0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh
    cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15 nodes
    are data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26,
    index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long
    time.

Please can you open an issue on github, with a full
recreation of what
you need to do to recreate this problem, plus all the logs
from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Simon:
Today, I upgraded the es cluster from 0.19.11 to 0.20.6 successfully. The
upgrade is smooth.
And this problem doesn't see again.
Thank you.

在 2013年4月1日星期一UTC+8上午11时21分49秒,Dong Aihua写道:

Hi, Simon:
I tested the latest trunk code, it also has no problem. The logs are
located at git@github.com:dongaihua/shares.git
By the way, we ever met TooManyOpenFiles problem, but after we change
the ulimit -n , we don't meet that problems again.
After you check the logs, can you give me a confirmation if the problem
is really resolved?
Thank you.

在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道:

Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:

Hmm that is interesting... I suspect that the fix for the missed
ChannelClosed events fixed it then though....I don't have you logs around
anymore but did you see any TooManyOpenFiles exceptions by any chance? Can
you confirm that the latest master has fixed this as well by running a
master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:

Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed
in this version.
Can you confirm it? And I'm wondering how do you fix it?
Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:

Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the
problem is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at https://github.com/dongaihua/shares/
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the
physical connections. If you really need those informaiton, I can ask and
give you response.
Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:

Thanks for the headsup, that is what I figured. I am just wondering
what triggered the intermediate disconnects that causes the servers to wait
on their peers they recover from. Can you tell a little about your setup,
are they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:

By the way, for the servers disconnect at the end of logs, the
reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:

So, I looked closer at the latest logs and I see a lot of
disconnects going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all
the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:

Hi, :
Is there any clue for this problem's root cause?
Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:

Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from https://github.com/dongaihua/shares
Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:

I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:

Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my
test steps:

  1. curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 28,
    "relocating_shards" : 0,
    "initializing_shards" : 2,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  2. curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":true}
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 30,
    "active_shards" : 57,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
  3. curl -XPUT 10.96.250.214:10200/abcd1234
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 45,
    "active_shards" : 87,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time
  4. curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
    {
    "cluster_name" : "es-test2-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 59,
    "active_shards" : 112,
    "relocating_shards" : 0,
    "initializing_shards" : 7,
    "unassigned_shards" : 1
    }
    The cluster stayed in this state for long time.
    The detail logs of 20 nodes are attached.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:

Hey,

thanks for testing this. I can see some exception during
recovery when starting a replica which causes one of the machines to wait
for an answer but it doesn't come back and it doesn't seem to get notified
that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:

Hi, Clint:
I already opened an issue
https://github.com/elasticsearch/elasticsearch/issues/2714
Today I tested the latest the ES version 0.90.0.Beta1, it
has the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is
the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:

  1. after the cluster is up, I created an empty index: test1
    curl -XPUT 10.96.250.214:10200/test1
    {"ok":true,"acknowledged":true}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 15,
    "active_shards" : 27,
    "relocating_shards" : 0,
    "initializing_shards" : 3,
    "unassigned_shards" : 0
    }
    The cluster stayed in this state for long time.
  2. Then I created another empty index: abcd1234
    curl -XPUT 10.96.250.214:10200/abcd1234
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 29,
    "active_shards" : 53,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 1
    The cluster stayed in this state for long time
  3. Then I created one more empty index:
    curl -XPUT 10.96.250.214:10200/1234abcd
    {"ok":true,"acknowledged":false}
    curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
    {
    "cluster_name" : "es-test-0-90-0-beta1",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 20,
    "number_of_data_nodes" : 15,
    "active_primary_shards" : 43,
    "active_shards" : 78,
    "relocating_shards" : 0,
    "initializing_shards" : 6,
    "unassigned_shards" : 6
    }
    The cluster stayed in this state for long time.
    You can refer the detail logs from 20 nodes in the
    attachments.
    Thank you.

-Regards-
-Dong Aihua-

在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:

On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

Hi, guys:
These days through the test, I found ES bugs in 0.20.4
and 0.20.5
which cause shards allocation failure and stuck in
initializing state.
The following are my test steps:

  1. I setup 20 nodes with 0.20.4, and bring a fresh
    cluster up. 3
    nodes are master nodes, 2 nodes are load balancer, 15
    nodes are data
    nodes
  2. After the cluster is up, I tried to create some empty
    indices for
    example index-2013-02-25, index-2013-02-26,
    index-2013-02-27,
    index-2013-03-01, etc
    But some shards stuck in initializing status for long
    time.

Please can you open an issue on github, with a full
recreation of what
you need to do to recreate this problem, plus all the logs
from all of
the nodes.

ta

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.