Elastic search Cluster indexing

hi,

I have a two node ES cluster.
I need to index the data on the cluster. Do I need to provide both the ips in my indexing script.
I am doing this using python.

Thanks.

1 Like

If you want to be fault-tolerant if one of the servers goes down you need to point to both servers but otherwise it's enough with one of them.

(The ES cluster itself obviously also needs to be resilient if one node goes down.)

1 Like

Hi Hitesh,

Your indexing script will work as long as it can connect to any available "client" node, that is any node with client mode enabled that routes queries to data nodes. So connecting to just one node would work fine.

I'd guess that your two nodes are set up in default mode that enables all modes - client/master/data.

Some people balance connections with dns round robin across all of their client nodes as well. So a single address that resolves to both of the elasticsearch nodes. This way your indexing script will connect to one node when it starts up and does the dns resolution. For high availability, some people set up a health check to detect when a node is no longer available and take it out of dns.

You could also spray indexing traffic at both of these nodes to balance the workload between them.
You can either do this with by fronting them with a load balancer (like an ELB/HAProxy) or if you're tricky enough you could have your script set up a separate set of threads for each node in your script config to balance indexing work to both of them.

You could run a copy of the index script for each node as well, if your use case allows for parallel indexing.

Let me know if this answers your question for you or if you want more info.

Thanks.

In this case if I index data only to one IP, and that IP node goes down will elastic be able to fetch data from the other node.

1 Like

If you're connecting to both servers in your script, your script should attempt retry or reconnection when indexing fails on a node that goes down, if designed right the separate threads that are connected to the other node should continue indexing.

With a load balancer or round robin dns, this should be handled to some extent with a health check probe of some type. With dns, your script will need to re-resolve on failure/reconnection to get to new ip to connect to.

Yes that helped.

So in my use case, I am indexing the data to single ip, and while fetching the data from API, I will also provide this same IP. Will the ES fetch data from both the nodes while quering or will it bring the data from same one ip that I provide in the API.

I want to provide only one IP at both the time while indexing and also while retrival, Is it a good practise and what are the consequences of using such approach.

1 Like

When querying, any node acting as a Client node will route queries to the data node that holds the particular shards with the documents that need to be retrieved or indexed to. Depending on what the query is and where the shards are allocated this can be one more more nodes. (eg, a match all query would request all docs, so would hit both nodes if the shards of an index is spread across both nodes).

Since you will have two nodes set up with defaults (client+data+master mode) it should work fine if you use either node for query/indexing.

When you need to scale your cluster, you have the option to use dedicated client nodes that only perform query/index routing and don't hold data and don't act as master eligible. You might also consider dedicated master nodes so that they don't have to worry about storing data and can be dedicated to managing cluster state.

You have the flexibility to scale as you need to fit your use case, but there are some known patterns that work quite well for scaling up.

In short, you should be fine to query either server. You can also look at load balancing methods for spreading traffic evenly and for high availability.