Searches on different nodes return different results

I have a five node cluster configured with 0.19.11 and have indexed a large
number of documents using 40 java client procs (in Hadoop Map/Reduce). Each
of the five nodes in my cluster has log entries recognizing the addition
and removal of client nodes with names such as Ryder and Cagliostro. So,
adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET
'http://hostXX.local:9200/_search?q=foo&pretty=true' gives different result
counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--

Update: Calls to http://hostXX.local:9200/_cluster/health on each server
shows that each server has found only itself:

{

  • cluster_name: "foo",
  • status: "yellow",
  • timed_out: false,
  • number_of_nodes: 1,
  • number_of_data_nodes: 1,
  • active_primary_shards: 5,
  • active_shards: 5,
  • relocating_shards: 0,
  • initializing_shards: 0,
  • unassigned_shards: 5

}

I've tried increasing the zen multicast timeout to 10s and that hasn'
helped :frowning:

Startup log from a node looks like:

[2012-12-04 15:35:52,029][INFO ][node ] [host01]
{0.19.11}[3109]: initializing ...
[2012-12-04 15:35:52,034][INFO ][plugins ] [host01] loaded
[], sites []
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: initialized
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: starting ...
[2012-12-04 15:35:54,244][INFO ][transport ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/172.21.44.10:9300]}
[2012-12-04 15:36:04,269][INFO ][cluster.service ] [host01]
new_master [host01][SduK1WDSQo-N0KFD6qdoLw][inet[/XXX.XX.XX.XX:9300]],
reason: zen-disco-join (elected_as_master)
[2012-12-04 15:36:04,298][INFO ][discovery ] [host01]
trace/SduK1WDSQo-N0KFD6qdoLw
[2012-12-04 15:36:04,327][INFO ][http ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/172.21.44.10:9200]}
[2012-12-04 15:36:04,328][INFO ][node ] [host01]
{0.19.11}[3109]: started
[2012-12-04 15:36:05,034][INFO ][gateway ] [host01]
recovered [1] indices into cluster_state

Where else can I look for clues as to why they don't find each other? Is
there a command line way to test multicast?

rob

On Tuesday, December 4, 2012 2:25:48 PM UTC, Rob Styles wrote:

I have a five node cluster configured with 0.19.11 and have indexed a
large number of documents using 40 java client procs (in Hadoop
Map/Reduce). Each of the five nodes in my cluster has log entries
recognizing the addition and removal of client nodes with names such as
Ryder and Cagliostro. So, adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET '
http://hostXX.local:9200/_search?q=foo&pretty=true' gives different
result counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--

Resolved - as most of you must have been thinking, multicast was not
working. In this case it was blocked at the switch. :frowning:

Useful tool for verifying multicast:
omping http://linux.die.net/man/8/omping

rob

On Tuesday, December 4, 2012 2:25:48 PM UTC, Rob Styles wrote:

I have a five node cluster configured with 0.19.11 and have indexed a
large number of documents using 40 java client procs (in Hadoop
Map/Reduce). Each of the five nodes in my cluster has log entries
recognizing the addition and removal of client nodes with names such as
Ryder and Cagliostro. So, adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET '
http://hostXX.local:9200/_search?q=foo&pretty=true' gives different
result counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--