ElasticSearch cluster with four nodes

My java application distribution is as below:
Four physical servers has three JVMs each. Hence total 12 instances of java applications are running. Each java application logs two different log files that are captured by Logstash and fed to Elasticsearch. Kibana displays the dashboard. When I run the application in only one JVM and single instances of ELK things works fine.

I am trying to setup ELK in clustered configuration. I am using the IP of the four machines, for the convenience of explaining and reference to the log files. The ips are
172.18.17.43 -- Elasticsearch client node
172.18.17.44 -- Elasticsearch data node1
172.18.17.45 -- Elasticsearch master node
172.18.17.46 -- Elasticsearch data node 2

Logstash is installed in each of the four machines, but points to elasticsearch in the master node(172.18.17.45). Hence the logstash.conf is same for all the four machines. Kibana is installed only in the machine having the Elasticsearch client (172.18.17.43).

The start sequence of ELK is as below:
Start Elasticsearch master, then start client node, then start the data modes. Logstash is also started in the same sequence. Kibana is started at last.

ELK gets started correctly, logs also gets posted to Kibana indexes. Data gets pased correctly. But after 5-10 mins, the Elasticsearch master crashes. Sometimes, the Kibana UI does not display anything. Any clue on what is wrong will be helpful.

Extract from the configuration files:

  1. Elasticsearch master(172.18.17.45) yml:

    cluster.name: npci
    node.name: "elasticsearch_master"
    node.master: true
    node.data: false
    network.publish_host: 172.18.17.45
    network.host: 172.18.17.45
    transport.tcp.port: 9300
    discovery.zen.minimum_master_nodes: 1
    discovery.zen.ping.multicast.enabled: false
    discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]
    http.cors.enabled: true

  2. Elasticsearch data node1(172.18.17.44) yml

cluster.name: npci
node.name: "elasticsearch_data1"
node.master: false
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 1
network.publish_host: 172.18.17.44
network.host: 172.18.17.44
transport.tcp.port: 9301
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]
http.cors.enabled: true
  1. Elasticsearch data node2(172.18.17.46) yml
cluster.name: npci
node.name: "elasticsearch_data2"
node.master: false
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
network.publish_host: 172.18.17.46
network.host: 172.18.17.46
transport.tcp.port: 9303
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]
http.jsonp.enable: true
  1. Elasticsearch client node(172.18.17.43) yml
cluster.name: npci
node.name: "elasticsearch_client"
node.master: false
node.data: false
index.number_of_shards: 0
index.number_of_replicas: 0
network.publish_host: 172.18.17.43
network.host: 172.18.17.43
transport.tcp.port: 9302
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]

The output in logstash.conf is as below:

output{
	elasticsearch { 
			host => "172.18.17.45"
			cluster => "npci"
	}

With this configuration ELK gets started correctly, logs also gets posted to Kibana indexes. Data gets pased correctly. But after 5-10 mins, the Elasticsearch master crashes. Sometimes, the Kibana UI does not display anything. Any clue on what is wrong will be helpful.

What do the ES logs show?
Also you shouldn't be indexing through your master, use the client nodes instead.

172.18.17.43 -- Elasticsearch client node
172.18.17.44 -- Elasticsearch data node1
172.18.17.45 -- Elasticsearch master node
172.18.17.46 -- Elasticsearch data node 2

Side note: You're probably siloing your nodes prematurely. With this setup your master node is a single point of failure for the whole cluster and I suspect you don't have the query load to warrant a separate client node.

Other comments:

  • Why have different port settings in transport.tcp.port?
  • There are a couple of settings (e.g. index.number_of_shards) that differ between the nodes that most likely should be the same. If you're maintaining these files by hand you're working too hard.

The ES logs do not show any error message.
@magnusbaeck: Are you suggesting some other setup to avoid the single point of failure of the master? I was thinking to keep single instances of Elasticsearch, Logstash and Kibana. Keeping multiple entries in the logstash.conf for log files from different application nodes. These logs will be kept in a shared drive, mounted in all the machines. Is that a good solution, considering the peak load as 5000 log entries per second for four hours.

For the TCP ports: As the Elasticsearch was crashing, I was trying different TCP ports.
Are you suggesting not to touch the elasticsearch.yml except for the following:

cluster.name: npci
node.name: "elasticsearch_client"
node.master: false
node.data: false
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]

Are you suggesting some other setup to avoid the single point of failure of the master?

Have three or more master nodes? With the small number of nodes you have you don't need dedicated master nodes and probably not dedicated client nodes either.

I was thinking to keep single instances of Elasticsearch, Logstash and Kibana. Keeping multiple entries in the logstash.conf for log files from different application nodes. These logs will be kept in a shared drive, mounted in all the machines. Is that a good solution, considering the peak load as 5000 log entries per second for four hours.

You mean have a single Logstash instance that reads log files from all machines via network-mounted file systems? That should work but is quite atypical.

For the TCP ports: As the Elasticsearch was crashing, I was trying different TCP ports.
Are you suggesting not to touch the elasticsearch.yml except for the following:

Don't make random configuration changes. Yes, the parameters you listed make up a reasonable minimum set.

If I do not keep any dedicated master and dedicated client, I think the only line I need to un-comment in elasticsearch.yml is:

cluster.name: npci

I shall keep logstash.conf as:

output{
elasticsearch { 
host => "localhost"
    }
}

I plan to keep Kibana in only one node. Will the elasticsearch data get replicated in all nodes in this configuration?
If it gets replicated in all nodes Kibana.yml can have

elasticsearch_url: "http://localhost:9200"

Else, please suggest which Elasticsearch node will it connect to.

Dear Magnus: Can you please share your views on the configurations in my last thread? Please let me know in case I need to mention something more.

We recommend the use of unicast over multicast https://www.elastic.co/guide/en/elasticsearch/guide/current/_important_configuration_changes.html#_prefer_unicast_over_multicast

If I do not use dedicated master node, still do I need to have the following?

discovery.zen.ping.multicast.enabled: false 
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300""]

If I do not keep any dedicated master and dedicated client, I think the only line I need to un-comment in elasticsearch.yml is:

cluster.name: npci

I shall keep logstash.conf as:

output{
elasticsearch { 
host => "localhost"
    }
}

If you change ES's cluster name the Logstash configuration needs to be adjusted accordingly, unless you use the HTTP protocol (which you don't, but should do).

I plan to keep Kibana in only one node. Will the elasticsearch data get replicated in all nodes in this configuration?

Actual replication of data depends on the replica count of each index. All data in a cluster is available to all nodes regardless of replication so you can connect to any cluster node. However, if you insist on having a client node you should connect to that node. That's the point of having a client node, that it relieves data nodes and master nodes from dealing directly with queries. But again, with a cluster your size and the query load I expect that you'll have it's overkill with a client node.

If I do not use dedicated master node, still do I need to have the following?

discovery.zen.ping.multicast.enabled: false 
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300""]

Yes. Newly started nodes need to be able to locate at least one other cluster node. This is unrelated to which master nodes you choose to have.

Thanks. Let me try with the following in each of the four elasticsearch.yml

cluster.name: npci
node.name: "elasticsearch_client" <will keep four different names>
discovery.zen.ping.multicast.enabled: false 
discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"]

Will keep the logstash.conf as below:

output{
	elasticsearch { 
			host => "172.18.17.45" <will keep four different IPs in four instanses of logstash.
			cluster => "npci"
	}
}

Where should I explicitly mention for HTTP?

Where should I explicitly mention for HTTP?

In the elasticsearch output in your Logstash configuration. See the documentation for details.

Now the following entry is there in all the four elasticsearch.yml.

Will it work for the elasticsearch that I am starting first, as no other Elasticsearch instance is not available, when the first elasticsearch instace is started.

With your current settings then yes the cluster will work with just one node up. But that's not a good situation. You should set discovery.zen.minimum_master_nodes to 3 to avoid split brain situations. Then at least three of the nodes need to be online for the cluster to work. In return, any node in the cluster can be shut down without affecting the cluster's availability.

discovery.zen.ping.unicast.hosts: ["172.18.17.45:9300"] -- Do we need to give 9300 or 9200.

Also, with the configuration we discussed in earlier threads ( no elastic search master) , I am getting the following error while starting logstash. Though I can ping http://10.1.1.11:9200 successfully.

INFO: I/O exception (org.apache.http.conn.HttpHostConnectException) caught                                                                                              when processing request to {}->http://10.1.1.11:9200: Connect to 10.1.1.11:                                                                                             9200 [/10.1.1.11] failed: Connection refused
Nov 04, 2015 5:23:37 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {}->http://10.1.1.11:9200
Nov 04, 2015 5:23:37 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (org.apache.http.conn.HttpHostConnectException) caught                                                                                              when processing request to {}->http://10.1.1.11:9200: Connect to 10.1.1.11:                                                                                             9200 [/10.1.1.11] failed: Connection refused

Cluster nodes talk to each other on port 9300. That's the default so can omit the port setting.