Searches on different nodes return different results

Rob_Styles · December 4, 2012, 2:25pm

I have a five node cluster configured with 0.19.11 and have indexed a large
number of documents using 40 java client procs (in Hadoop Map/Reduce). Each
of the five nodes in my cluster has log entries recognizing the addition
and removal of client nodes with names such as Ryder and Cagliostro. So,
adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET
'http://hostXX.local:9200/_search?q=foo&pretty=true' gives different result
counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--

Rob_Styles · December 4, 2012, 3:45pm

Update: Calls to http://hostXX.local:9200/_cluster/health on each server
shows that each server has found only itself:

{

cluster_name: "foo",
status: "yellow",
timed_out: false,
number_of_nodes: 1,
number_of_data_nodes: 1,
active_primary_shards: 5,
active_shards: 5,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 5

}

I've tried increasing the zen multicast timeout to 10s and that hasn'
helped

Startup log from a node looks like:

[2012-12-04 15:35:52,029][INFO ][node ] [host01]
{0.19.11}[3109]: initializing ...
[2012-12-04 15:35:52,034][INFO ][plugins ] [host01] loaded
, sites
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: initialized
[2012-12-04 15:35:54,122][INFO ][node ] [host01]
{0.19.11}[3109]: starting ...
[2012-12-04 15:35:54,244][INFO ][transport ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/172.21.44.10:9300]}
[2012-12-04 15:36:04,269][INFO ][cluster.service ] [host01]
new_master [host01][SduK1WDSQo-N0KFD6qdoLw][inet[/XXX.XX.XX.XX:9300]],
reason: zen-disco-join (elected_as_master)
[2012-12-04 15:36:04,298][INFO ][discovery ] [host01]
trace/SduK1WDSQo-N0KFD6qdoLw
[2012-12-04 15:36:04,327][INFO ][http ] [host01]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/172.21.44.10:9200]}
[2012-12-04 15:36:04,328][INFO ][node ] [host01]
{0.19.11}[3109]: started
[2012-12-04 15:36:05,034][INFO ][gateway ] [host01]
recovered [1] indices into cluster_state

Where else can I look for clues as to why they don't find each other? Is
there a command line way to test multicast?

rob

On Tuesday, December 4, 2012 2:25:48 PM UTC, Rob Styles wrote:

I have a five node cluster configured with 0.19.11 and have indexed a
large number of documents using 40 java client procs (in Hadoop
Map/Reduce). Each of the five nodes in my cluster has log entries
recognizing the addition and removal of client nodes with names such as
Ryder and Cagliostro. So, adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET '
http://hostXX.local:9200/_search?q=foo&pretty=true' gives different
result counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--

Rob_Styles · December 4, 2012, 4:27pm

Resolved - as most of you must have been thinking, multicast was not
working. In this case it was blocked at the switch.

Useful tool for verifying multicast:
omping http://linux.die.net/man/8/omping

rob

On Tuesday, December 4, 2012 2:25:48 PM UTC, Rob Styles wrote:

I have a five node cluster configured with 0.19.11 and have indexed a
large number of documents using 40 java client procs (in Hadoop
Map/Reduce). Each of the five nodes in my cluster has log entries
recognizing the addition and removal of client nodes with names such as
Ryder and Cagliostro. So, adding the data all seems to have gone fine.

If I ask each node in turn for a count I get the following:

curl -XGET 'http://hostXX.local:9200/_count?pretty=true'

host01 - "count" : 430793,
host02 - "count" : 430805,
host03 - "count" : 431772,
host04 - "count" : 431217,
host05 - "count" : 431500,

Each node says it has 5 total and 5 successful shards.

A call to curl -XGET 'http://hostXX.local:9200/_cluster/state?pretty=true'
returns a cluster state that lists all five shards local on each server and
doesn't list any other servers. Calls to
http://hostXX.local:9200/_cluster/nodes/_all?pretty=true only lists the
server to which the request is sent. I was expecting to see all of them.

All of the servers are running with the same cluster.name and all have
specific host.name assignments in the config too.

Calls to each server using curl -XGET '
http://hostXX.local:9200/_search?q=foo&pretty=true' gives different
result counts and different results in the listings.

ifconfig eth0 shows MULTICAST is enabled and netstat -q on each server
shows that it is part of the 224.2.2.4 group. The network is not slow; it's
a physical cluster not VMs and not on EC2. Typical ping responses are
around 0.1ms between boxes.

What should I look at next?

rob

--

Topic		Replies	Views
Inconsistent results (again) Elasticsearch	6	287	July 6, 2017
Different results on search with several nodes Elasticsearch	6	637	July 6, 2017
2 nodes in the cluster while it should be 1 Elasticsearch	5	1656	July 6, 2017
Two same nodes but different stats Elasticsearch	3	757	July 5, 2017
Inconsistency in search result counts in single node that I do not see with two nodes Elasticsearch	1	359	July 6, 2017

Searches on different nodes return different results

Related topics