Update: Calls to http://hostXX.local:9200/_cluster/health on each server
shows that each server has found only itself:
{
- cluster_name: "foo",
- status: "yellow",
- timed_out: false,
- number_of_nodes: 1,
- number_of_data_nodes: 1,
- active_primary_shards: 5,
- active_shards: 5,
- relocating_shards: 0,
- initializing_shards: 0,
- unassigned_shards: 5
}
I've tried increasing the zen multicast timeout to 10s and that hasn'
helped
Startup log from a node looks like:
[2012-12-04 15:35:52,029][INFO ][node ] [host01]
{0.19.11}[3109]: initializing ...
[2012-12-04 15:35:52,034][INFO ][plugins ] [host01] loaded
, sites
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: initialized
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: starting ...
[2012-12-04 15:35:54,244][INFO ][transport ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/172.21.44.10:9300]}
[2012-12-04 15:36:04,269][INFO ][cluster.service ] [host01]
new_master [host01][SduK1WDSQo-N0KFD6qdoLw][inet[/XXX.XX.XX.XX:9300]],
reason: zen-disco-join (elected_as_master)
[2012-12-04 15:36:04,298][INFO ][discovery ] [host01]
trace/SduK1WDSQo-N0KFD6qdoLw
[2012-12-04 15:36:04,327][INFO ][http ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/172.21.44.10:9200]}
[2012-12-04 15:36:04,328][INFO ][node ] [host01]
{0.19.11}[3109]: started
[2012-12-04 15:36:05,034][INFO ][gateway ] [host01]
recovered [1] indices into cluster_state
Where else can I look for clues as to why they don't find each other? Is
there a command line way to test multicast?
rob
On Tuesday, December 4, 2012 2:25:48 PM UTC, Rob Styles wrote:
I have a five node cluster configured with 0.19.11 and have indexed a
large number of documents using 40 java client procs (in Hadoop
Map/Reduce). Each of the five nodes in my cluster has log entries
recognizing the addition and removal of client nodes with names such as
Ryder and Cagliostro. So, adding the data all seems to have gone fine.
If I ask each node in turn for a count I get the following:
curl -XGET 'http://hostXX.local:9200/_count?pretty=true'
host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,
Each node says it has 5 total and 5 successful shards.
A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.
All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.
Calls to each server using curl -XGET '
http://hostXX.local:9200/_search?q=foo&pretty=true' gives different
result counts and different results in the listings.
ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.
What should I look at next?
rob
--