Why is has_parent so slow? and can anything be done?

Hello,

I am writing an analytics application that makes heavy use of aggregations.

My situation seems suited to parent/child. I have relatively few parents
(hundreds) and a lot more children (tens of millions).

The has_parent query or filter provide an elegant way to perform the sort
of queries I want, but the problem is they are very slow (several seconds)
compared to those that don't use them (100s of milliseconds)

If I generate the parent ids on the client side and then use them in a
terms filer on the "_parent" fields, things seem to be significantly faster
(although still not ideal)

The documentation I have read indicates that has_parent can be expected to
be slow, but most suggested mitigations seem to be about reducing memory
usage rather than speeding up queries.

I am loathe to give up on a functionally elegant solution. Why is
has_parent so slow? Is there anything I could try to speed has_parent up?
Should scaling out to more nodes help in this situation?

cheers
Perryn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFps6aDDzFnSkQKr2aNVgqpM4Eu5YZHcDST%3D24g0A6ngOqCXEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Further investigation shows that anything that makes use of _parent seems
to result in slow queries, be it has_parent, has_child or the 'children'
aggregation.

I should mention that I am using 1.4.4 - is this to be expected even with
the performance improvements made in recent releases?

On Mon, Mar 2, 2015 at 12:23 PM, Perryn Fowler perryn.fowler@gmail.com
wrote:

Hello,

I am writing an analytics application that makes heavy use of aggregations.

My situation seems suited to parent/child. I have relatively few parents
(hundreds) and a lot more children (tens of millions).

The has_parent query or filter provide an elegant way to perform the sort
of queries I want, but the problem is they are very slow (several seconds)
compared to those that don't use them (100s of milliseconds)

If I generate the parent ids on the client side and then use them in a
terms filer on the "_parent" fields, things seem to be significantly faster
(although still not ideal)

The documentation I have read indicates that has_parent can be expected to
be slow, but most suggested mitigations seem to be about reducing memory
usage rather than speeding up queries.

I am loathe to give up on a functionally elegant solution. Why is
has_parent so slow? Is there anything I could try to speed has_parent up?
Should scaling out to more nodes help in this situation?

cheers
Perryn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFps6aCEX2J4i1-SaRd6he63PdHu08mrLR3FbwfyPFraOcrnzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Are you also adding/modifying documents while searching with has_parent or
has_child query?
In that it makes sense to enable global ordinals loading on the _parent
field:

There is work going to be done to improve the has_child / has_parent
queries when these queries are part of a bigger query (for example a bool
query): Refactor parent/child to not be Lucene queries · Issue #8134 · elastic/elasticsearch · GitHub

Are you using score_mode? That makes things more expensive, so if you don't
need you can turn it off.

Scaling out by adding more nodes does help to improve the query time.

The has_parent / has_child queries come at a performance penalty. If you
design your documents you should consider if you de-normalize your data so
that you don't need parent/child, which makes your searches fast. However
this is sometimes expensive because documents tend to get large or the
amount of document to be updated makes simple updates from the application
expensive. In those case parent/child should be considered.

On 3 March 2015 at 03:12, Perryn Fowler perryn.fowler@gmail.com wrote:

Further investigation shows that anything that makes use of _parent seems
to result in slow queries, be it has_parent, has_child or the 'children'
aggregation.

I should mention that I am using 1.4.4 - is this to be expected even with
the performance improvements made in recent releases?

On Mon, Mar 2, 2015 at 12:23 PM, Perryn Fowler perryn.fowler@gmail.com
wrote:

Hello,

I am writing an analytics application that makes heavy use of
aggregations.

My situation seems suited to parent/child. I have relatively few parents
(hundreds) and a lot more children (tens of millions).

The has_parent query or filter provide an elegant way to perform the sort
of queries I want, but the problem is they are very slow (several seconds)
compared to those that don't use them (100s of milliseconds)

If I generate the parent ids on the client side and then use them in a
terms filer on the "_parent" fields, things seem to be significantly faster
(although still not ideal)

The documentation I have read indicates that has_parent can be expected
to be slow, but most suggested mitigations seem to be about reducing memory
usage rather than speeding up queries.

I am loathe to give up on a functionally elegant solution. Why is
has_parent so slow? Is there anything I could try to speed has_parent up?
Should scaling out to more nodes help in this situation?

cheers
Perryn

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFps6aCEX2J4i1-SaRd6he63PdHu08mrLR3FbwfyPFraOcrnzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFps6aCEX2J4i1-SaRd6he63PdHu08mrLR3FbwfyPFraOcrnzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TwKcD-R_f6vPiWH4ZK9WyZ0WjG3g5CKPF%3Db9g5kS0NiUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Martin,

Thanks very much for your help.

The final product will be indexing new documents at the same time as
querying, but thus far for my performance trials I am performing
queries/aggs only. I assume therefore that enabling eager global ordinals
would not help with the performance issues I am seeing. (as an aside, if I
do enable global ordinals by updating the mapping, do I need to re-index
everything for it to take effect?)

I was using a has_parent filter, so it was my understanding that score_mode
was irrelevant? (I did manage to get slightly better performance using a
has_parent query wrapped in a constant_score query - I will try score_mode)

In general, I think my situation is the perfect use case for parent/child
(I have a lot of child documents with immutable data, and much fewer parent
documents with changeable data. I want to be able to aggregate across the
child documents using buckets derived from fields on the parents Eg: find
the average of 'reading' (child document) in each 'location' (parent
document)).

Quite often, the 'location' is recorded incorrectly and needs to be
updated, which makes de-normalisation infeasible since all the child
documents would need to be updated (and there are millions)

I am finding though, that any use of the parent/child relationship
(has_parent, has_child, children aggregation) slows down results by an
order of magnitude over queries that only aggregate directly over the child
documents.

If this is to be expected, then I may have to resort to a client side join
approach coupled with 'filters' aggregations to provide bucketing. This
will be significantly more fiddly from a code perspective though, so I just
want to make sure I'm not missing something.

cheers
Perryn

On Wed, Mar 4, 2015 at 1:05 AM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

Are you also adding/modifying documents while searching with has_parent or
has_child query?
In that it makes sense to enable global ordinals loading on the _parent
field:

Elasticsearch Platform — Find real-time answers at scale | Elastic

There is work going to be done to improve the has_child / has_parent
queries when these queries are part of a bigger query (for example a bool
query): Refactor parent/child to not be Lucene queries · Issue #8134 · elastic/elasticsearch · GitHub

Are you using score_mode? That makes things more expensive, so if you
don't need you can turn it off.

Scaling out by adding more nodes does help to improve the query time.

The has_parent / has_child queries come at a performance penalty. If you
design your documents you should consider if you de-normalize your data so
that you don't need parent/child, which makes your searches fast. However
this is sometimes expensive because documents tend to get large or the
amount of document to be updated makes simple updates from the application
expensive. In those case parent/child should be considered.

On 3 March 2015 at 03:12, Perryn Fowler perryn.fowler@gmail.com wrote:

Further investigation shows that anything that makes use of _parent seems
to result in slow queries, be it has_parent, has_child or the 'children'
aggregation.

I should mention that I am using 1.4.4 - is this to be expected even with
the performance improvements made in recent releases?

On Mon, Mar 2, 2015 at 12:23 PM, Perryn Fowler perryn.fowler@gmail.com
wrote:

Hello,

I am writing an analytics application that makes heavy use of
aggregations.

My situation seems suited to parent/child. I have relatively few parents
(hundreds) and a lot more children (tens of millions).

The has_parent query or filter provide an elegant way to perform the
sort of queries I want, but the problem is they are very slow (several
seconds) compared to those that don't use them (100s of milliseconds)

If I generate the parent ids on the client side and then use them in a
terms filer on the "_parent" fields, things seem to be significantly faster
(although still not ideal)

The documentation I have read indicates that has_parent can be expected
to be slow, but most suggested mitigations seem to be about reducing memory
usage rather than speeding up queries.

I am loathe to give up on a functionally elegant solution. Why is
has_parent so slow? Is there anything I could try to speed has_parent up?
Should scaling out to more nodes help in this situation?

cheers
Perryn

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFps6aCEX2J4i1-SaRd6he63PdHu08mrLR3FbwfyPFraOcrnzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFps6aCEX2J4i1-SaRd6he63PdHu08mrLR3FbwfyPFraOcrnzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TwKcD-R_f6vPiWH4ZK9WyZ0WjG3g5CKPF%3Db9g5kS0NiUQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TwKcD-R_f6vPiWH4ZK9WyZ0WjG3g5CKPF%3Db9g5kS0NiUQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFps6aDYkzzc6SB8RWrjPURayAV_t7CUdfNSpjAr4zuqoXaTBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.