Please explain the flow of data?


(Josh Harrison) #1

I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load balancer" nodes would do. Both in the context of indexing and
searching.
Is there a functional difference between a "coordinator" node and a "search
load balancer" node, beyond the fact that a "search load balancer" node
can't be elected master?

Say I have a 4 node cluster. There's a master only "coordinator" node, that
doesn't store data, named "master".
node.master: true
node.data: false

There are three data only nodes, "A", "B" and "C"
node.master: false
node.date: true

I have an index "test" with two shards and one replica. Primary shard 0
lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica
shard 1 lives on A.

I send the command
curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'

A connection is made to master, and the data is sent to master to be
indexed. Master randomly decides to place this document in shard 1, so it
gets sent to the primary shard 1 on C and replica shard 1 on B, right? This
is where routing can come in, I can say that that document really should go
to shard 0 because I said so.

So this is a fairly simple scenario, assuming I'm correct.

What benefit do I get to indexing when I add more "coordinator" nodes?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

How about on the searching side of things?
I send a search to master,
curl -XPOST http://master:9200/test/test/_search -d
'{"query":{"match_all":{}}}'

Master sends these queries off to A, B and C, who each generate their own
results and return them to master. Each data node queries all the relevant
shards that are present locally and then combines those results for
delivery to master. Do only primary shards get queried, or are replica
shards queried too?
Master takes these combined results from all the relevant nodes and
combines them into the final query response.

Same questions:
What benefit do I get to searching when I add more nodes that are like
master?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

Is the only difference between a
node.master: true
node.data: false
and a
node.master: false
node.data: false
that the node is a candidate to be a master, should it be elected?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

A couple of things;

  1. You should have n/2+1 masters in your cluster, where n = number of
    nodes. This helps prevent split brain situations and is best practise.
  2. Your master nodes can store data, this way you don't need to add more
    nodes to fulfil the above.

Your indexing scenario is correct.
For searching, replica's and primaries can be queried.
For both - Adding more masters adds redundancy as per the first two points.
Adding more search nodes won't do much though other than reduce the load on
your masters (unless someone else can add anything I don't know :p).

And for your final question, yes that is correct.

To give you an idea of practical application, we don't use search nodes but
have 3 non-data masters that handle all queries, and a bunch of data only
nodes for storing everything.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 22 March 2014 08:25, Josh Harrison hijakk@gmail.com wrote:

I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load balancer" nodes would do. Both in the context of indexing and
searching.
Is there a functional difference between a "coordinator" node and a
"search load balancer" node, beyond the fact that a "search load balancer"
node can't be elected master?

Say I have a 4 node cluster. There's a master only "coordinator" node,
that doesn't store data, named "master".
node.master: true
node.data: false

There are three data only nodes, "A", "B" and "C"
node.master: false
node.date: true

I have an index "test" with two shards and one replica. Primary shard 0
lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica
shard 1 lives on A.

I send the command
curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'

A connection is made to master, and the data is sent to master to be
indexed. Master randomly decides to place this document in shard 1, so it
gets sent to the primary shard 1 on C and replica shard 1 on B, right? This
is where routing can come in, I can say that that document really should go
to shard 0 because I said so.

So this is a fairly simple scenario, assuming I'm correct.

What benefit do I get to indexing when I add more "coordinator" nodes?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

How about on the searching side of things?
I send a search to master,
curl -XPOST http://master:9200/test/test/_search -d
'{"query":{"match_all":{}}}'

Master sends these queries off to A, B and C, who each generate their own
results and return them to master. Each data node queries all the relevant
shards that are present locally and then combines those results for
delivery to master. Do only primary shards get queried, or are replica
shards queried too?
Master takes these combined results from all the relevant nodes and
combines them into the final query response.

Same questions:
What benefit do I get to searching when I add more nodes that are like
master?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

Is the only difference between a
node.master: true
node.data: false
and a
node.master: false
node.data: false
that the node is a candidate to be a master, should it be elected?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624YebkNK-nJgH63qP2p0pbw4ctUxVoArHYvT0qXDXmPsbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Josh Harrison) #3

Awesome, ok, thank you.
Is the logic behind not allowing storage on master nodes to both:
Take advantage of a system with limited storage resources
and
Have a dedicated results aggregator/search handler?

I can imagine if I had a particularly badly written gnarly search, trying
to deal with the results on a master and a querying the results at the same
time could be bad.

So in a 16 node cluster you'd want to have 9 nodes allowed to be masters,
(n/2)+1?

Thanks again!
Josh

On Friday, March 21, 2014 3:20:24 PM UTC-7, Mark Walkom wrote:

A couple of things;

  1. You should have n/2+1 masters in your cluster, where n = number of
    nodes. This helps prevent split brain situations and is best practise.
  2. Your master nodes can store data, this way you don't need to add
    more nodes to fulfil the above.

Your indexing scenario is correct.
For searching, replica's and primaries can be queried.
For both - Adding more masters adds redundancy as per the first two
points. Adding more search nodes won't do much though other than reduce the
load on your masters (unless someone else can add anything I don't know :p).

And for your final question, yes that is correct.

To give you an idea of practical application, we don't use search nodes
but have 3 non-data masters that handle all queries, and a bunch of data
only nodes for storing everything.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 22 March 2014 08:25, Josh Harrison <hij...@gmail.com <javascript:>>wrote:

I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load balancer" nodes would do. Both in the context of indexing and
searching.
Is there a functional difference between a "coordinator" node and a
"search load balancer" node, beyond the fact that a "search load balancer"
node can't be elected master?

Say I have a 4 node cluster. There's a master only "coordinator" node,
that doesn't store data, named "master".
node.master: true
node.data: false

There are three data only nodes, "A", "B" and "C"
node.master: false
node.date: true

I have an index "test" with two shards and one replica. Primary shard 0
lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica
shard 1 lives on A.

I send the command
curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'

A connection is made to master, and the data is sent to master to be
indexed. Master randomly decides to place this document in shard 1, so it
gets sent to the primary shard 1 on C and replica shard 1 on B, right? This
is where routing can come in, I can say that that document really should go
to shard 0 because I said so.

So this is a fairly simple scenario, assuming I'm correct.

What benefit do I get to indexing when I add more "coordinator" nodes?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

How about on the searching side of things?
I send a search to master,
curl -XPOST http://master:9200/test/test/_search -d
'{"query":{"match_all":{}}}'

Master sends these queries off to A, B and C, who each generate their own
results and return them to master. Each data node queries all the relevant
shards that are present locally and then combines those results for
delivery to master. Do only primary shards get queried, or are replica
shards queried too?
Master takes these combined results from all the relevant nodes and
combines them into the final query response.

Same questions:
What benefit do I get to searching when I add more nodes that are like
master?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

Is the only difference between a
node.master: true
node.data: false
and a
node.master: false
node.data: false
that the node is a candidate to be a master, should it be elected?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b45303b-b012-4c3c-9bd7-86cf02d7f937%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #4

Yes you can leverage a master to be a search node in that way.

We have a 15 node cluster with 3 masters, I'm thinking I'll add another 2
when we add a few more data nodes in the next few weeks. Essentially you
want an uneven number of masters to ensure a quorum is reached. But when
you start getting large clusters, ie tens of nodes, it doesn't make as much
sense to have n/2+1 masters.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 22 March 2014 09:36, Josh Harrison hijakk@gmail.com wrote:

Awesome, ok, thank you.
Is the logic behind not allowing storage on master nodes to both:
Take advantage of a system with limited storage resources
and
Have a dedicated results aggregator/search handler?

I can imagine if I had a particularly badly written gnarly search, trying
to deal with the results on a master and a querying the results at the same
time could be bad.

So in a 16 node cluster you'd want to have 9 nodes allowed to be masters,
(n/2)+1?

Thanks again!
Josh

On Friday, March 21, 2014 3:20:24 PM UTC-7, Mark Walkom wrote:

A couple of things;

  1. You should have n/2+1 masters in your cluster, where n = number of
    nodes. This helps prevent split brain situations and is best practise.
  2. Your master nodes can store data, this way you don't need to add
    more nodes to fulfil the above.

Your indexing scenario is correct.
For searching, replica's and primaries can be queried.
For both - Adding more masters adds redundancy as per the first two
points. Adding more search nodes won't do much though other than reduce the
load on your masters (unless someone else can add anything I don't know :p).

And for your final question, yes that is correct.

To give you an idea of practical application, we don't use search nodes
but have 3 non-data masters that handle all queries, and a bunch of data
only nodes for storing everything.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 22 March 2014 08:25, Josh Harrison hij...@gmail.com wrote:

I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load balancer" nodes would do. Both in the context of indexing and
searching.
Is there a functional difference between a "coordinator" node and a
"search load balancer" node, beyond the fact that a "search load balancer"
node can't be elected master?

Say I have a 4 node cluster. There's a master only "coordinator" node,
that doesn't store data, named "master".
node.master: true
node.data: false

There are three data only nodes, "A", "B" and "C"
node.master: false
node.date: true

I have an index "test" with two shards and one replica. Primary shard 0
lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica
shard 1 lives on A.

I send the command
curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'

A connection is made to master, and the data is sent to master to be
indexed. Master randomly decides to place this document in shard 1, so it
gets sent to the primary shard 1 on C and replica shard 1 on B, right? This
is where routing can come in, I can say that that document really should go
to shard 0 because I said so.

So this is a fairly simple scenario, assuming I'm correct.

What benefit do I get to indexing when I add more "coordinator" nodes?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

How about on the searching side of things?
I send a search to master,
curl -XPOST http://master:9200/test/test/_search -d
'{"query":{"match_all":{}}}'

Master sends these queries off to A, B and C, who each generate their
own results and return them to master. Each data node queries all the
relevant shards that are present locally and then combines those results
for delivery to master. Do only primary shards get queried, or are replica
shards queried too?
Master takes these combined results from all the relevant nodes and
combines them into the final query response.

Same questions:
What benefit do I get to searching when I add more nodes that are like
master?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

Is the only difference between a
node.master: true
node.data: false
and a
node.master: false
node.data: false
that the node is a candidate to be a master, should it be elected?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b45303b-b012-4c3c-9bd7-86cf02d7f937%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/5b45303b-b012-4c3c-9bd7-86cf02d7f937%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624YaRJPq_T8GKDuVyZjjmFHE0JZQ36Vo-8GrTk0JOTNpvg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5