[parent-child] on shards/nodes number


(Davide Palmisano) #1

Dear all,

I have a 3 nodes cluster with 5 active shards. Each node uses about 12GB of
RAM.
The whole index contains approximately ~230 millions documents with two
different types.

The first type has something like ~100 millions of documents and there is a
parent-child relationship between those two types. I didn't implement any
custom routing, so I cannot make statements on how the documents are
allocated to shards.

As I was expecting, parent-child queries are slower than direct queries but
for the moment I cannot change this indexing schema. From my understanding,
it seems that each parent-child query is submitted to each shard from the
gateway and then results are sorted back.

So I was thinking: is it a good idea to reduce the number of nodes (to 2)
with a bigger amount of memory available each and also reducing the number
of shards (to 1 or 2 per node max)?

The main idea here is to reduce the network overhead when moving data
between shards and the gateway. Is it viable or am saying bullshits? :slight_smile:

any hint will be appreciated,

thank you in advance,

--
Davide Palmisano


http://twitter.com/dpalmisano

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #2

A parent and its children always reside on the same shard, so the 'join' is
executed locally. The reason that p/c queries are slower is because the
internal shard join so takes time. For example on the shard level: the
inner query is executed and we keep track of the parent ids that have
matched and optionally also keep track of the scores per parent id. Then ES
iterates over all the parent documents and see if their are parent docs
with ids have been seen in the first phase search.

The information send between shards for the reduce is the the same as with
normal queries. I think reducing to 3 shards make more sense, since the
search request can then just parallelised between the nodes.

On 10 October 2013 12:38, Davide Palmisano dpalmisano@gmail.com wrote:

Dear all,

I have a 3 nodes cluster with 5 active shards. Each node uses about 12GB
of RAM.
The whole index contains approximately ~230 millions documents with two
different types.

The first type has something like ~100 millions of documents and there is
a parent-child relationship between those two types. I didn't implement any
custom routing, so I cannot make statements on how the documents are
allocated to shards.

As I was expecting, parent-child queries are slower than direct queries
but for the moment I cannot change this indexing schema. From my
understanding, it seems that each parent-child query is submitted to each
shard from the gateway and then results are sorted back.

So I was thinking: is it a good idea to reduce the number of nodes (to 2)
with a bigger amount of memory available each and also reducing the number
of shards (to 1 or 2 per node max)?

The main idea here is to reduce the network overhead when moving data
between shards and the gateway. Is it viable or am saying bullshits? :slight_smile:

any hint will be appreciated,

thank you in advance,

--
Davide Palmisano

http://davidepalmisano.com
http://twitter.com/dpalmisano

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Davide Palmisano) #3

Dear Martijn,

thanks for your response.

The information send between shards for the reduce is the the same as
with normal queries. I think reducing to 3 shards make more sense, since
the search request can then just parallelised between the nodes.

Yeah, I think i'm going to benchmark the two configurations to see if my
idea is confirmed and I'll post here the results if that may help someone
else :slight_smile:

thanks

On Thursday, October 10, 2013 12:20:33 PM UTC+1, Martijn v Groningen wrote:

A parent and its children always reside on the same shard, so the 'join'
is executed locally. The reason that p/c queries are slower is because the
internal shard join so takes time. For example on the shard level: the
inner query is executed and we keep track of the parent ids that have
matched and optionally also keep track of the scores per parent id. Then ES
iterates over all the parent documents and see if their are parent docs
with ids have been seen in the first phase search.

The information send between shards for the reduce is the the same as with
normal queries. I think reducing to 3 shards make more sense, since the
search request can then just parallelised between the nodes.

On 10 October 2013 12:38, Davide Palmisano <dpalm...@gmail.com<javascript:>

wrote:

Dear all,

I have a 3 nodes cluster with 5 active shards. Each node uses about 12GB
of RAM.
The whole index contains approximately ~230 millions documents with two
different types.

The first type has something like ~100 millions of documents and there is
a parent-child relationship between those two types. I didn't implement any
custom routing, so I cannot make statements on how the documents are
allocated to shards.

As I was expecting, parent-child queries are slower than direct queries
but for the moment I cannot change this indexing schema. From my
understanding, it seems that each parent-child query is submitted to each
shard from the gateway and then results are sorted back.

So I was thinking: is it a good idea to reduce the number of nodes (to 2)
with a bigger amount of memory available each and also reducing the number
of shards (to 1 or 2 per node max)?

The main idea here is to reduce the network overhead when moving data
between shards and the gateway. Is it viable or am saying bullshits? :slight_smile:

any hint will be appreciated,

thank you in advance,

--
Davide Palmisano

http://davidepalmisano.com
http://twitter.com/dpalmisano

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4