Performance problems with has parent filter


(Lauri) #1

Hi,

I'm having performance problems with has parent filter.

The for the child document is:
{
"program": {
"_parent": { "type": "series" },
...
}
}

And for the parent document:
{
"series": {
...
"properties": {
...
"subject":{
"type": "object",
"properties": {
...
"_path": {
"type": "object",
"properties": {
"id": { "type": "string", "analyzer": "path_analyzer" }
...
}
}
}
},
...
}
}
}

If I search documents of type program (the child) like this:
{
"from": 0,
"size": 25,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"has_parent": {
"filter": {
"terms" : {
"subject._path.id" : [ "5-162" ]
}
},
"parent_type" : "series"
}
}
}
}
}

It takes constantly around 160 milliseconds to run and it returns finds
about 60k documents.

If I search documents of type series (the parent) like this:
{
"from" : 0,
"size" : 25,
"query" : {
"filtered": {
"query": { "match_all": {} },
"filter": {
"terms": {
"subject._path.id": [ "5-162" ]
}
}
}
}
}

It takes around 5 milliseconds and returns about 400 documents.

The total count of program objects is about 1,7M and series objects 11k.
The index is fully optimized and the cluster is not doing anything else.
The index has 3 shards and 1 replica of each shard. There are three nodes
in the cluster. The nodes have twice the ram that is the index size. Half
of the ram is assigned to Elasticsearch. Elasticsearch version is 1.0. If I
use bigdesk plugin, it looks like there is more than enough ram. I'm not
seeing cache evictions or something like that.

So for me it looks like there is something weird going on as the has parent
filter runs more than 30 times slower than the actual parent query. Is
there anything I can do to make it faster?

Thanks,
Lauri

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e2c05acb-99e3-4f00-816f-7d1e33d7bbfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #2

Looking briefly, sounds normal to me. Remember has_parent is a "join"
whereas your other query is a straight to one type/no join query. The only
thing I can think of is if you feel you have spare capacity per node, try
increasing the number of shards a bit (like maybe 6 shards) and see if it
makes any difference.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29e65cfa-03c6-4acf-a6cd-7e882be22ce5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Karol Gwaj) #3

there is not that much you can really do here
parent/child queries tend to be very slow & eat a lot of heap space

i had similar performance problem
in my case i had 3 level relationship (parent/child/grandchild) and query
time was in average x10 slower for every level

so my suggestion will be to switch to using nested documents + update api
if your query time is more important than update time, that will be the way
to go
(in my case query performance improvement was x100 times)


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html

Regards,
Karol Gwaj

On Sunday, March 30, 2014 8:28:33 AM UTC+1, Lauri wrote:

Hi,

I'm having performance problems with has parent filter.

The for the child document is:
{
"program": {
"_parent": { "type": "series" },
...
}
}

And for the parent document:
{
"series": {
...
"properties": {
...
"subject":{
"type": "object",
"properties": {
...
"_path": {
"type": "object",
"properties": {
"id": { "type": "string", "analyzer": "path_analyzer" }
...
}
}
}
},
...
}
}
}

If I search documents of type program (the child) like this:
{
"from": 0,
"size": 25,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"has_parent": {
"filter": {
"terms" : {
"subject._path.id" : [ "5-162" ]
}
},
"parent_type" : "series"
}
}
}
}
}

It takes constantly around 160 milliseconds to run and it returns finds
about 60k documents.

If I search documents of type series (the parent) like this:
{
"from" : 0,
"size" : 25,
"query" : {
"filtered": {
"query": { "match_all": {} },
"filter": {
"terms": {
"subject._path.id": [ "5-162" ]
}
}
}
}
}

It takes around 5 milliseconds and returns about 400 documents.

The total count of program objects is about 1,7M and series objects 11k.
The index is fully optimized and the cluster is not doing anything else.
The index has 3 shards and 1 replica of each shard. There are three nodes
in the cluster. The nodes have twice the ram that is the index size. Half
of the ram is assigned to Elasticsearch. Elasticsearch version is 1.0. If I
use bigdesk plugin, it looks like there is more than enough ram. I'm not
seeing cache evictions or something like that.

So for me it looks like there is something weird going on as the has
parent filter runs more than 30 times slower than the actual parent query.
Is there anything I can do to make it faster?

Thanks,
Lauri

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91c59820-c9e6-40fc-8f7f-b2ee1a4cd19e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Lauri) #4

Hi,

Thank you for your replies!

I was afraid that the answer would be something like that. I was just
amazed how slow the has parent filter is as any other queries take just a
few milliseconds to execute. I guess I have to find out how I could
denormalize my data. The problem is that the parents may update frequently
and they can potentially have thousands of children.

Best,
Lauri

On Tue, Apr 1, 2014 at 11:34 AM, Karol Gwaj karol@gwaj.me wrote:

there is not that much you can really do here
parent/child queries tend to be very slow & eat a lot of heap space

i had similar performance problem
in my case i had 3 level relationship (parent/child/grandchild) and query
time was in average x10 slower for every level

so my suggestion will be to switch to using nested documents + update api
if your query time is more important than update time, that will be the
way to go
(in my case query performance improvement was x100 times)

http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html

Regards,
Karol Gwaj

On Sunday, March 30, 2014 8:28:33 AM UTC+1, Lauri wrote:

Hi,

I'm having performance problems with has parent filter.

The for the child document is:
{
"program": {
"_parent": { "type": "series" },
...
}
}

And for the parent document:
{
"series": {
...
"properties": {
...
"subject":{
"type": "object",
"properties": {
...
"_path": {
"type": "object",
"properties": {
"id": { "type": "string", "analyzer": "path_analyzer" }
...
}
}
}
},
...
}
}
}

If I search documents of type program (the child) like this:
{
"from": 0,
"size": 25,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"has_parent": {
"filter": {
"terms" : {
"subject._path.id" : [ "5-162" ]
}
},
"parent_type" : "series"
}
}
}
}
}

It takes constantly around 160 milliseconds to run and it returns finds
about 60k documents.

If I search documents of type series (the parent) like this:
{
"from" : 0,
"size" : 25,
"query" : {
"filtered": {
"query": { "match_all": {} },
"filter": {
"terms": {
"subject._path.id": [ "5-162" ]
}
}
}
}
}

It takes around 5 milliseconds and returns about 400 documents.

The total count of program objects is about 1,7M and series objects 11k.
The index is fully optimized and the cluster is not doing anything else.
The index has 3 shards and 1 replica of each shard. There are three nodes
in the cluster. The nodes have twice the ram that is the index size. Half
of the ram is assigned to Elasticsearch. Elasticsearch version is 1.0. If I
use bigdesk plugin, it looks like there is more than enough ram. I'm not
seeing cache evictions or something like that.

So for me it looks like there is something weird going on as the has
parent filter runs more than 30 times slower than the actual parent query.
Is there anything I can do to make it faster?

Thanks,
Lauri

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/91c59820-c9e6-40fc-8f7f-b2ee1a4cd19e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/91c59820-c9e6-40fc-8f7f-b2ee1a4cd19e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BMBRY%2BR17LhZrL%2B5pOFWLCwXOrZs4Foaw2Y67QmhhUaZn2zMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5