Proper cluster setup


(GX) #1

Hi All

I been trying to get this setup working for some time but cant seen to get
it right, I have a problem with the terminology used and often get mixed up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site) clustered
for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes get
all the data, I was informed the is called 'split brain' and was suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after doing
some batch indexing one of the nodes kept timing out and would not restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the
elasticsearch.yml is not identical for each node where/how do I specify
which config file to use?

After much development and preparation to implement ES Im disappointed with
this hurdle and have lost some confidence of data integrity, restarting a
node wiped all data from other nodes in one of my tests, I know this is due
to misconfiguration but is a major concern.

Regards

GX


(Radu Gheorghe) #2

Hi,

I would just try to make a subdirectory for each node and start from there.

Quite off-topic: I think you'll end up better if you keep data on the local
storage of each node and just to backups on the NFS share (disable flush
while you do that).

On Monday, July 23, 2012 7:34:30 PM UTC+3, GX wrote:

Hi All

I been trying to get this setup working for some time but cant seen to get
it right, I have a problem with the terminology used and often get mixed up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site)
clustered for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes get
all the data, I was informed the is called 'split brain' and was suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after doing
some batch indexing one of the nodes kept timing out and would not restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the
elasticsearch.yml is not identical for each node where/how do I specify
which config file to use?

After much development and preparation to implement ES Im disappointed
with this hurdle and have lost some confidence of data integrity,
restarting a node wiped all data from other nodes in one of my tests, I
know this is due to misconfiguration but is a major concern.

Regards

GX


(Ivan Brusic) #3

One major item of information that you are missing is the number of
shards and replicas for your index. It is very likely that you do not
have a split brain scenario, but just that your index is simply
divided unevenly between your nodes. With the default settings of 5
shards and 1 replica, you will have 10 total shards divided among 4
nodes.

Regarding the data path: are all nodes pointing to the exact same
mount point? That would cause an error. Each node should have their
own unique data path or use the shared FS gateway:

http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html

--
Ivan

On Mon, Jul 23, 2012 at 9:34 AM, GX mailme.gx@gmail.com wrote:

Hi All

I been trying to get this setup working for some time but cant seen to get
it right, I have a problem with the terminology used and often get mixed up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site) clustered
for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes get
all the data, I was informed the is called 'split brain' and was suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after doing
some batch indexing one of the nodes kept timing out and would not restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the elasticsearch.yml
is not identical for each node where/how do I specify which config file to
use?

After much development and preparation to implement ES Im disappointed with
this hurdle and have lost some confidence of data integrity, restarting a
node wiped all data from other nodes in one of my tests, I know this is due
to misconfiguration but is a major concern.

Regards

GX


(GX) #4

Hi Ivan and Radu

Thanks for your replies

I think you both misunderstood, data path for each node is a local folder,
I did not manage to get it working to use a nfs path
path.data: /mnt/sdb1/data/
this means each node has its own local data path and I will have 4 copies
of the data.

Ivan I dont understand why the default settings (5 shards and 1 replica)
would make any sense, does that mean that all nodes can query the same data
from all shards and replicas?
I did look at gateways in the past but decided to rather have a copy of
data on each node for redundancy as a self backup system.

I am also looking into multicast, someone suggested it may not be enabled
on the network.

When using bigdesk it shows the cluster has only 1 node.

Regards

GX

On Tuesday, July 24, 2012 11:55:27 PM UTC+3, Ivan Brusic wrote:

One major item of information that you are missing is the number of
shards and replicas for your index. It is very likely that you do not
have a split brain scenario, but just that your index is simply
divided unevenly between your nodes. With the default settings of 5
shards and 1 replica, you will have 10 total shards divided among 4
nodes.

Regarding the data path: are all nodes pointing to the exact same
mount point? That would cause an error. Each node should have their
own unique data path or use the shared FS gateway:

http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html

--
Ivan

On Mon, Jul 23, 2012 at 9:34 AM, GX wrote:

Hi All

I been trying to get this setup working for some time but cant seen to
get
it right, I have a problem with the terminology used and often get mixed
up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site)
clustered
for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes
get
all the data, I was informed the is called 'split brain' and was
suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after
doing
some batch indexing one of the nodes kept timing out and would not
restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the
elasticsearch.yml
is not identical for each node where/how do I specify which config file
to
use?

After much development and preparation to implement ES Im disappointed
with
this hurdle and have lost some confidence of data integrity, restarting
a
node wiped all data from other nodes in one of my tests, I know this is
due
to misconfiguration but is a major concern.

Regards

GX


(GX) #5

Hi All

Thanks for help. I set multicast off and added host ips now bigdesk shows 4
nodes so it seems that was the problem...

Regards

GX

On Wednesday, July 25, 2012 8:04:38 AM UTC+3, GX wrote:

Hi Ivan and Radu

Thanks for your replies

I think you both misunderstood, data path for each node is a local folder,
I did not manage to get it working to use a nfs path
path.data: /mnt/sdb1/data/
this means each node has its own local data path and I will have 4 copies
of the data.

Ivan I dont understand why the default settings (5 shards and 1 replica)
would make any sense, does that mean that all nodes can query the same data
from all shards and replicas?
I did look at gateways in the past but decided to rather have a copy of
data on each node for redundancy as a self backup system.

I am also looking into multicast, someone suggested it may not be enabled
on the network.

When using bigdesk it shows the cluster has only 1 node.

Regards

GX

On Tuesday, July 24, 2012 11:55:27 PM UTC+3, Ivan Brusic wrote:

One major item of information that you are missing is the number of
shards and replicas for your index. It is very likely that you do not
have a split brain scenario, but just that your index is simply
divided unevenly between your nodes. With the default settings of 5
shards and 1 replica, you will have 10 total shards divided among 4
nodes.

Regarding the data path: are all nodes pointing to the exact same
mount point? That would cause an error. Each node should have their
own unique data path or use the shared FS gateway:

http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html

--
Ivan

On Mon, Jul 23, 2012 at 9:34 AM, GX wrote:

Hi All

I been trying to get this setup working for some time but cant seen to
get
it right, I have a problem with the terminology used and often get
mixed up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site)
clustered
for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes
get
all the data, I was informed the is called 'split brain' and was
suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after
doing
some batch indexing one of the nodes kept timing out and would not
restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the
elasticsearch.yml
is not identical for each node where/how do I specify which config file
to
use?

After much development and preparation to implement ES Im disappointed
with
this hurdle and have lost some confidence of data integrity, restarting
a
node wiped all data from other nodes in one of my tests, I know this is
due
to misconfiguration but is a major concern.

Regards

GX


(phill) #6

On 7/24/2012 10:04 PM, GX wrote:

[...] Ivan I dont understand why the default settings (5 shards and 1
replica) would make any sense, does that mean that all nodes can query
the same data from all shards and replicas?

I have only recently been working with Elastic Search for myself, so I
can sympathize with the problem of terminology.
One of the phrases you used suggests you might not have all terms
straightened out.
"does that mean that all nodes can query the same data..." seems not
exactly on the mark.

A node is one OS running Elastic Search.
Nodes are organized into clusters.
An Elastic Search Index is made of a set of shards and replicas of each
shard running on a cluster of nodes.
When you create an ES index, the shards will be distributed around to
the nodes in the cluster.
Any replica shards also will be distributed around the cluster.
Any one shard is actually a separate Lucene index with it's own terms,
documents, frequency information etc.

Each shard only contains some of documents in an index. Typically the
documents are balanced between all the different shards.

Which node have which shards, where the replicas are and which shard
will get a document are all controllable within ES.

When you index a document, it ends up being routed to ONLY_ONE shard
in the index and copied to that shards replicas.

When you search an ES index by sending a query to a node, the query is
sent to all shards in the index.

Therefore nodes don't really "query the same same data", but if you ask
one node it will consolidate all results from all shards in the same
index including its own shard.
Sure a nerdy technical distinction, but I think it is worth mentioning.

If someone who has been at this longer sees any flaws in my attempt to
describe the terms please jump in.

I was bitten by the distributed nature of answering a query, the very
1st time I sent a query using the Java API. I built up a minimal search
request and had asked for the 1st 10 of all documents
without any other settings. Against a Lucene index this always be the
same documents. But in ES the results can be inconsistent, because
without some sorting or scoring, the results from any shard where as
good as any other, so ES just gave them the 1st 10 it found.
That certainly helped me to understand Cluster, Node, Index, Shard and
Documents in ES.

-Paul

I did look at gateways in the past but decided to rather have a copy
of data on each node for redundancy as a self backup system.

I am also looking into multicast, someone suggested it may not be
enabled on the network.

When using bigdesk it shows the cluster has only 1 node.

Regards

GX

On Tuesday, July 24, 2012 11:55:27 PM UTC+3, Ivan Brusic wrote:

One major item of information that you are missing is the number of
shards and replicas for your index. It is very likely that you do not
have a split brain scenario, but just that your index is simply
divided unevenly between your nodes. With the default settings of 5
shards and 1 replica, you will have 10 total shards divided among 4
nodes.

Regarding the data path: are all nodes pointing to the exact same
mount point? That would cause an error. Each node should have their
own unique data path or use the shared FS gateway:

http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html
<http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html>


-- 
Ivan

On Mon, Jul 23, 2012 at 9:34 AM, GX  wrote:
> Hi All
>
> I been trying to get this setup working for some time but cant
seen to get
> it right, I have a problem with the terminology used and often
get mixed up
> what is needed.
>
> My configuration is as follows
>
> 4 servers running 2 websites (beta and live version of same
site) clustered
> for high availability and load balance.
>
> My elasticsearch setup is as follows
> a network mapped drive (nfs) maps to /my_cluster
> in there is elasticsearch
> /my_cluster/apps/elasticseach
> I could never get es to run with data on a mapped drive so I
have the
> following settings (these are the only non default):
>
> path.data: /mnt/sdb1/data/
> path.logs: /my_cluster/logs/elasticsearch
> node.master: true
> node.data: true
>
> however with this configuration when indexing documents not all
nodes get
> all the data, I was informed the is called 'split brain' and was
suggested
> to use 'minimum_master_nodes' which I set to 3, this worked but
after doing
> some batch indexing one of the nodes kept timing out and would
not restart.
>
> What is the correct configuration to have for this setup?
> Since all nodes are running from the same directory if the
elasticsearch.yml
> is not identical for each node where/how do I specify which
config file to
> use?
>
> After much development and preparation to implement ES Im
disappointed with
> this hurdle and have lost some confidence of data integrity,
restarting a
> node wiped all data from other nodes in one of my tests, I know
this is due
> to misconfiguration but is a major concern.
>
> Regards
>
> GX
>

(GX) #7

Thank you for the clarification Paul.

GX

On Tuesday, July 31, 2012 9:00:19 PM UTC+3, P Hill wrote:

On 7/24/2012 10:04 PM, GX wrote:

[...] Ivan I dont understand why the default settings (5 shards and 1
replica) would make any sense, does that mean that all nodes can query the
same data from all shards and replicas?

I have only recently been working with Elastic Search for myself, so I can
sympathize with the problem of terminology.
One of the phrases you used suggests you might not have all terms
straightened out.
"does that mean that all nodes can query the same data..." seems not
exactly on the mark.

A node is one OS running Elastic Search.
Nodes are organized into clusters.
An Elastic Search Index is made of a set of shards and replicas of each
shard running on a cluster of nodes.
When you create an ES index, the shards will be distributed around to the
nodes in the cluster.
Any replica shards also will be distributed around the cluster.
Any one shard is actually a separate Lucene index with it's own terms,
documents, frequency information etc.

Each shard only contains some of documents in an index. Typically the
documents are balanced between all the different shards.

Which node have which shards, where the replicas are and which shard will
get a document are all controllable within ES.

When you index a document, it ends up being routed to ONLY_ONE shard in
the index and copied to that shards replicas.

When you search an ES index by sending a query to a node, the query is
sent to all shards in the index.

Therefore nodes don't really "query the same same data", but if you ask
one node it will consolidate all results from all shards in the same index
including its own shard.
Sure a nerdy technical distinction, but I think it is worth mentioning.

If someone who has been at this longer sees any flaws in my attempt to
describe the terms please jump in.

I was bitten by the distributed nature of answering a query, the very 1st
time I sent a query using the Java API. I built up a minimal search
request and had asked for the 1st 10 of all documents
without any other settings. Against a Lucene index this always be the
same documents. But in ES the results can be inconsistent, because without
some sorting or scoring, the results from any shard where as good as any
other, so ES just gave them the 1st 10 it found.
That certainly helped me to understand Cluster, Node, Index, Shard and
Documents in ES.

-Paul

I did look at gateways in the past but decided to rather have a copy of
data on each node for redundancy as a self backup system.

I am also looking into multicast, someone suggested it may not be enabled
on the network.

When using bigdesk it shows the cluster has only 1 node.

Regards

GX

On Tuesday, July 24, 2012 11:55:27 PM UTC+3, Ivan Brusic wrote:

One major item of information that you are missing is the number of
shards and replicas for your index. It is very likely that you do not
have a split brain scenario, but just that your index is simply
divided unevenly between your nodes. With the default settings of 5
shards and 1 replica, you will have 10 total shards divided among 4
nodes.

Regarding the data path: are all nodes pointing to the exact same
mount point? That would cause an error. Each node should have their
own unique data path or use the shared FS gateway:

http://www.elasticsearch.org/guide/reference/modules/gateway/fs.html

--
Ivan

On Mon, Jul 23, 2012 at 9:34 AM, GX wrote:

Hi All

I been trying to get this setup working for some time but cant seen to
get
it right, I have a problem with the terminology used and often get
mixed up
what is needed.

My configuration is as follows

4 servers running 2 websites (beta and live version of same site)
clustered
for high availability and load balance.

My elasticsearch setup is as follows
a network mapped drive (nfs) maps to /my_cluster
in there is elasticsearch
/my_cluster/apps/elasticseach
I could never get es to run with data on a mapped drive so I have the
following settings (these are the only non default):

path.data: /mnt/sdb1/data/
path.logs: /my_cluster/logs/elasticsearch
node.master: true
node.data: true

however with this configuration when indexing documents not all nodes
get
all the data, I was informed the is called 'split brain' and was
suggested
to use 'minimum_master_nodes' which I set to 3, this worked but after
doing
some batch indexing one of the nodes kept timing out and would not
restart.

What is the correct configuration to have for this setup?
Since all nodes are running from the same directory if the
elasticsearch.yml
is not identical for each node where/how do I specify which config file
to
use?

After much development and preparation to implement ES Im disappointed
with
this hurdle and have lost some confidence of data integrity, restarting
a
node wiped all data from other nodes in one of my tests, I know this is
due
to misconfiguration but is a major concern.

Regards

GX


(system) #8