ElasticSearch : observer: timeout notification from cluster service


#1

I have a ElasticSearch Cluster with 3 Data Master Nodes, one dedicated Client Node & a logstash sending events to Elasticsearch Cluster via the elasticsearch client node.

The Client is not able to connect to the cluster and seeing the below errors in log:-

[2015-10-24 00:18:29,657][DEBUG][action.admin.indices.create] [ESClient] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-10-24 00:18:30,743][DEBUG][action.admin.indices.create] [ESClient] no known master node, scheduling a retry

I have gone through this stackoverflow answer but it is not working for me. My Master-Data node's elastic search config looks like below:-

cluster.name: elasticsearch
node.name: "ESMasterData1"
node.master: true
node.data: true
index.number_of_shards: 7
index.number_of_replicas: 1
bootstrap.mlockall: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["es-master3:9300", "es-client:9300", "es-master2:9300", "es-master1:9300"]
cloud.aws.access_key: AK
cloud.aws.secret_key: J0

My Client Config looks like below:-

cluster.name: elasticsearch
node.name: "ESClient"
node.master: false
node.data: false
index.number_of_shards: 7
index.number_of_replicas: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["es-master1:9300", "es-master2:9300", "es-master3:9300", "kibana:9300"]
bootstrap.mlockall: true
cloud.aws.access_key: AK
cloud.aws.secret_key: J0

Logstash Output looks like below:

elasticsearch {
      index => "j-%{env}-%{app}-%{iver}-%{[@metadata][app_log_time]}"
      cluster => "elasticsearch"
      host => "es-client"
      port => "9300"
      protocol => "transport"
    }

I have tried the following things without luck :frowning: :-

  • JVM Heap Memory has been set to 30 GB in all the ES Nodes
  • mlockall set to true in all the nodes
  • Telnet is working fine from ES Client Node to ES Master-Data nodes on port 9300.
  • I have also verified TCP & UDP is enabled between the client & data-master machine by using iperf.
  • The three ES Master-Data nodes are able to talk to each other & the cluster status is reported as green when queried via one of the ES Master-Data Node but the query fails with MasterNotFoundException when queried via the ES Client machine.
  • None of the machines are in AWS.

Environment:-

  • ElasticSearch 1.7.1
  • OS - Debian 7

Can some one let me know what is going wrong or how can I debug this?


Message "timeout notification from cluster service"
(Luca Wintergerst) #2

I dont know the solution, but there are a few things you could try:

what is kibana:9300? Is this a node? this seems wrong

add the node itself to its own unicast host config


(Christian Dahlqvist) #3

What does the cluster health look like? Do you by any chance have a very large number of shards as you have 7 shards and 1 replica as default for every index and generate index names based on a significant number of parameters in your Logstash configuration?


#4
{
  "cluster_name": "elasticsearch",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": ​4,
  "number_of_data_nodes": ​3,
  "active_primary_shards": ​43,
  "active_shards": ​86,
  "relocating_shards": ​0,
  "initializing_shards": ​0,
  "unassigned_shards": ​0,
  "delayed_unassigned_shards": ​0,
  "number_of_pending_tasks": ​0,
  "number_of_in_flight_fetch": ​0
}

Stopping all nodes in the cluster and then starting them one by one solves the problem for some time then after some time the problem repeats again.


(Christian Dahlqvist) #5

That is a very manageable number of shards and all looks good. What have you got minimum master nodes set to? Are there any error messages in the logs apart from what you listed?


#6

I see the below error :-

[2015-10-26 01:17:23,846][INFO ][discovery.zen            ] [ESMasterData1] failed to send join request to master [[ESMasterData2][HqgkEYtdTwS4Q6SnxGFh4g][es-master2][inet[/172.16.84.218:9300]]{master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]] 

and sometimes I see the long gc warning:-

[2015-10-26 04:14:32,355][WARN ][monitor.jvm              ] [ESMasterData1] [gc][old][8430][4] duration [53.8s], collections [1]/[54.3s], total [53.8s]/[54.3s], memory [24.2gb]->[23.8gb]/[29.9gb], all_pools {[young] [12.8mb]->[17.4mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [24.1gb]->[23.8gb]/[29.1gb]}

Minimum Master Nodes - 2 .

Should moving to dedicated master node (rather than having master-data node) will help me?


(Luca Wintergerst) #7

There is your problem!

Your data nodes are busy collecting garbage and therefore can't answer the join request in time (default 30s i think)

dedicated master nodes will solve this.

You dont need extra servers for this. You can run multiple instances on one server


#8

But the long gc warnings are very intermittent. How often are the join requests sent?

Is running multiple instance in a single server good practice? Also I am having a Master-Data Node configuration and there I am seeing the long gc warning sometimes. How will the dedicated master and dedicated data instance in a single node solve the issue?


(Luca Wintergerst) #9

While a gc is running, the node is 'dead'. It cant do anything.

A dedicated master node will have its own JVM and therefore wont be affected by the gc of the data jvm

We are running two instances on one node. This is also recommended if you used servers with more than 64GB of RAM


#10

Thanks Luca :slight_smile: .


(system) #11