Hi, Simon:
Today I tested 0.19.12, it is fine.
Then I tested 0.20.0, the problem happened again. So I guess the problem
is introduced since 0.20.0.
The test steps are as before. And the following are some logs:
{
"cluster_name" : "test-0.20.0",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 224,
"active_shards" : 445,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
master_node:
{ name: 'xseed021.kdev',
transport_address: 'inet[/10.96.250.211:9400]',
attributes: { data: 'false', master: 'true' } },
initShards:
[ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2012-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: false,
shard: 11,
index: 'test-2013-03-11' },
{ node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
primary: true,
shard: 7,
index: 'testx-xxx-2013-zzz' },
{ node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
primary: false,
shard: 5,
index: 'testx-xxx-2013-yyy' } ],
unassigned_shards_total: 1,
unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
The detail logs are located at GitHub - dongaihua/shares: share something
test-0.20.0_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/test-0.20.0_LOGS.tar.gz
By the way, the servers are setup by IT team, I'm not clear the physical
connections. If you really need those informaiton, I can ask and give you
response.
Thank you.
-Regards-
-dongaihua-
在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what
triggered the intermediate disconnects that causes the servers to wait on
their peers they recover from. Can you tell a little about your setup, are
they VMs do you have firewalls in place. Are the servers in different
datascenters, what is the connection between them?
simon
On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is
I did a shutdown command after the test finish
在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects
going on giving me the impression you have some network issues.
Nevertheless we pushed some stuff to detect these situations earlier but
non of us was able to reproduce your issues. the only thing I can ask you
for is to try again with latest master to see if those commits helped in
any way?
What is your setup by the way, any idea why servers disconnect all the
time?
simon
On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
Is there any clue for this problem's root cause?
Thank you.
-Regards-
-Dong Aihua-
在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gzhttps://github.com/dongaihua/shares/blob/master/es-test2-0-90-0-beta1_LOGS.tar.gz
from GitHub - dongaihua/shares: share something
Thank you.
-Regards-
-Dong aihua-
在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340
error. I will upload the logs later.
在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting
indices.recovery.internal_action_timeout: 30s.
The same problem happened the again.
The configuration is same as before. The following are my test
steps:
- curl -XPUT 10.96.250.214:10200/test1
{"ok":true,"acknowledged":true}
{
"cluster_name" : "es-test2-0-90-0-beta1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 15,
"active_shards" : 28,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0
}
The cluster stayed in this state for long time
- curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
"cluster_name" : "es-test2-0-90-0-beta1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 30,
"active_shards" : 57,
"relocating_shards" : 0,
"initializing_shards" : 3,
"unassigned_shards" : 0
}
- curl -XPUT 10.96.250.214:10200/abcd1234
{
"cluster_name" : "es-test2-0-90-0-beta1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 45,
"active_shards" : 87,
"relocating_shards" : 0,
"initializing_shards" : 3,
"unassigned_shards" : 0
}
The cluster stayed in this state for long time
- curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
{
"cluster_name" : "es-test2-0-90-0-beta1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 59,
"active_shards" : 112,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 1
}
The cluster stayed in this state for long time.
The detail logs of 20 nodes are attached.
Thank you.
-Regards-
-Dong Aihua-
在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,
thanks for testing this. I can see some exception during recovery
when starting a replica which causes one of the machines to wait for an
answer but it doesn't come back and it doesn't seem to get notified that
the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout:
30s"
So we can see if this happens just because of the closed
connections here?
simon
On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
I already opened an issue
ES have bugs in 0.20.4 and 0.20.5 which cause creating some indices failure. · Issue #2714 · elastic/elasticsearch · GitHub
Today I tested the latest the ES version 0.90.0.Beta1, it has
the same problem.
My test configure is as following:
20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the
leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
My test step is as following:
- after the cluster is up, I created an empty index: test1
curl -XPUT 10.96.250.214:10200/test1
{"ok":true,"acknowledged":true}
curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
{
"cluster_name" : "es-test-0-90-0-beta1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 15,
"active_shards" : 27,
"relocating_shards" : 0,
"initializing_shards" : 3,
"unassigned_shards" : 0
}
The cluster stayed in this state for long time.
- Then I created another empty index: abcd1234
curl -XPUT 10.96.250.214:10200/abcd1234
{"ok":true,"acknowledged":false}
curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
{
"cluster_name" : "es-test-0-90-0-beta1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 29,
"active_shards" : 53,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 1
The cluster stayed in this state for long time
- Then I created one more empty index:
curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":false}
curl -XGET 10.96.250.211:10200/_cluster/health?pretty=true
{
"cluster_name" : "es-test-0-90-0-beta1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 15,
"active_primary_shards" : 43,
"active_shards" : 78,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 6
}
The cluster stayed in this state for long time.
You can refer the detail logs from 20 nodes in the attachments.
Thank you.
-Regards-
-Dong Aihua-
在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:
Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and
0.20.5
which cause shards allocation failure and stuck in initializing
state.
The following are my test steps:
- I setup 20 nodes with 0.20.4, and bring a fresh cluster
up. 3
nodes are master nodes, 2 nodes are load balancer, 15 nodes are
data
nodes
- After the cluster is up, I tried to create some empty
indices for
example index-2013-02-25, index-2013-02-26, index-2013-02-27,
index-2013-03-01, etc
But some shards stuck in initializing status for long time.
Please can you open an issue on github, with a full recreation of
what
you need to do to recreate this problem, plus all the logs from
all of
the nodes.
ta
Clint
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.