Parent/child not works on some data


(Serg) #1

Hi all!
I have experienced problem with searching using of parent/child relation
between document.

I have the following mappings:

And I'm running the following request:
{
"timeout": 180000,
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"company_id": "4d07f9c8775911968cab4a80"
}
},
{
"has_child": {
"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"company_id": "4d07f9c8775911968cab4a80"
}
},
{
"term": {
"name": "twitter"
}
}
]
},
"query": {
"match_all": {}
}
}
},
"type": "tag"
}
}
]
}
},
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"should": []
}
}
}
}
}

You can see timeout option in request.
When I have indexed about 2Gb of data (all data for "company_id":
"4d07f9c8775911968cab4a80") this query works OK. Request takes about 50ms.
But when I have indexed all required data (about 35Gb that includes
additional companies) after starting executing this query cluster hungs for
few hours. No errors in log for this time. After some time I can see that
some shards gone to "not initialized" state. Cluster becomes available only
after restart of all nodes.

I'm suspecting that "has_child" query doesn't work either on some specific
documents or on bigger datasets.

Used ES version 19.10
Cluster consists of 3 aws m1.xlarge instances. ES works with the following
jvm options:
-Xms14g -Xmx14g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Delasticsearch -Des.foreground=yes -Djava.net.preferIPv4Stack=true

--


(Martijn Van Groningen) #2

Hi Serg,

That doesn't look good. Can you share with us your nodes stats
(http://localhost:9200/_nodes/stats?all)?
Also when the cluster hangs, can you query the hot threads api
(http://localhost:9200/_nodes/hot_threads)?
How many primary and replica shards you have configured for your index?

The has_child (and others like has_parent and top_children) need an
in-memory data structure to perform efficiently. The size of this data
structure depends on the number of unique parent ids and the number of
documents. It is possible that you need to increase the heap space
size for ES or increase the number of nodes.

Martijn

On 15 October 2012 18:21, Serg Pilipenko cloun.rules@gmail.com wrote:

Hi all!
I have experienced problem with searching using of parent/child relation
between document.

I have the following mappings:
https://gist.github.com/3893320

And I'm running the following request:
{
"timeout": 180000,
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"company_id": "4d07f9c8775911968cab4a80"
}
},
{
"has_child": {
"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"company_id": "4d07f9c8775911968cab4a80"
}
},
{
"term": {
"name": "twitter"
}
}
]
},
"query": {
"match_all": {}
}
}
},
"type": "tag"
}
}
]
}
},
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"should": []
}
}
}
}
}

You can see timeout option in request.
When I have indexed about 2Gb of data (all data for "company_id":
"4d07f9c8775911968cab4a80") this query works OK. Request takes about 50ms.
But when I have indexed all required data (about 35Gb that includes
additional companies) after starting executing this query cluster hungs for
few hours. No errors in log for this time. After some time I can see that
some shards gone to "not initialized" state. Cluster becomes available only
after restart of all nodes.

I'm suspecting that "has_child" query doesn't work either on some specific
documents or on bigger datasets.

Used ES version 19.10
Cluster consists of 3 aws m1.xlarge instances. ES works with the following
jvm options:
-Xms14g -Xmx14g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Delasticsearch -Des.foreground=yes -Djava.net.preferIPv4Stack=true

--

--
Met vriendelijke groet,

Martijn van Groningen

--


(Serg) #3

That doesn't look good. Can you share with us your nodes stats
(http://localhost:9200/_nodes/stats?all)?
Also when the cluster hangs, can you query the hot threads api
(http://localhost:9200/_nodes/hot_threads)?
How many primary and replica shards you have configured for your index?

I'll provide this stats tomorrow.

The has_child (and others like has_parent and top_children) need an

in-memory data structure to perform efficiently. The size of this data
structure depends on the number of unique parent ids and the number of
documents. It is possible that you need to increase the heap space
size for ES or increase the number of nodes.

Does this implementation take into account filters from main query and
subquery?

--


(Martijn Van Groningen) #4

Does this implementation take into account filters from main query and
subquery?

No. The cached data structure is reused across search requests and
therefore doesn't take into account a search request's query and
filter.

Martijn

--


(Serg) #5

That doesn't look good. Can you share with us your nodes stats
(http://localhost:9200/_nodes/stats?all)?

Also when the cluster hangs, can you query the hot threads api
(http://localhost:9200/_nodes/hot_threads)?

for i in {1..250}; do curl -XGET
'http://127.0.0.1:9200/_nodes/hot_threads?interval=2000' >>
hot_threads.txt; done;
https://gist.github.com/3903012

And no still no response for a long time.

How many primary and replica shards you have configured for your index?

15 primary shards, 1 replica set

Are there any other possible approaches to implement relation between
different docs and to do not perform 2 requests with computing on client
side?
The main problem for me is that adding/deletion of tag will cause
reindexation of thousands of contacts if I include tag doc into contact doc.

--


(Martijn Van Groningen) #6

Hi Serg,

Did you ran the nodes stats request before or after you started
testing your queries? Seems that only a fraction of the allocated heap
space is actually used.

I see that the m1.xlarge instance has 15GB of memory available. In
your case you allocated 14GB of that to ES's heap space. ES depends a
lot on the OS file system cache. Right now the OS has only 1GB left.
This can make any kind of query slow. Usually a healthy balance is 50%
of the available memory to ES and the other 50% to OS. I'd set the
ES_HEAP_SIZE to 7GB.

From what I can see from the hot threads output is that it is loading
the data structures used by the has_child query. During the first
has_child query execution on a fresh index, the data structure it
needs is loaded from disk into memory. This can make the first
execution of the has_child query slow. Subsequent has_child queries
should be much faster.

Martijn

On 17 October 2012 02:51, Serg Pilipenko cloun.rules@gmail.com wrote:

That doesn't look good. Can you share with us your nodes stats
(http://localhost:9200/_nodes/stats?all)?

https://gist.github.com/3902994

Also when the cluster hangs, can you query the hot threads api
(http://localhost:9200/_nodes/hot_threads)?

for i in {1..250}; do curl -XGET
'http://127.0.0.1:9200/_nodes/hot_threads?interval=2000' >> hot_threads.txt;
done;
https://gist.github.com/3903012

And no still no response for a long time.

How many primary and replica shards you have configured for your index?

15 primary shards, 1 replica set

Are there any other possible approaches to implement relation between
different docs and to do not perform 2 requests with computing on client
side?
The main problem for me is that adding/deletion of tag will cause
reindexation of thousands of contacts if I include tag doc into contact doc.

--

--
Met vriendelijke groet,

Martijn van Groningen

--


(system) #7