You can see timeout option in request.
When I have indexed about 2Gb of data (all data for "company_id":
"4d07f9c8775911968cab4a80") this query works OK. Request takes about 50ms.
But when I have indexed all required data (about 35Gb that includes
additional companies) after starting executing this query cluster hungs for
few hours. No errors in log for this time. After some time I can see that
some shards gone to "not initialized" state. Cluster becomes available only
after restart of all nodes.
I'm suspecting that "has_child" query doesn't work either on some specific
documents or on bigger datasets.
Used ES version 19.10
Cluster consists of 3 aws m1.xlarge instances. ES works with the following
jvm options:
-Xms14g -Xmx14g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Delasticsearch -Des.foreground=yes -Djava.net.preferIPv4Stack=true
The has_child (and others like has_parent and top_children) need an
in-memory data structure to perform efficiently. The size of this data
structure depends on the number of unique parent ids and the number of
documents. It is possible that you need to increase the heap space
size for ES or increase the number of nodes.
You can see timeout option in request.
When I have indexed about 2Gb of data (all data for "company_id":
"4d07f9c8775911968cab4a80") this query works OK. Request takes about 50ms.
But when I have indexed all required data (about 35Gb that includes
additional companies) after starting executing this query cluster hungs for
few hours. No errors in log for this time. After some time I can see that
some shards gone to "not initialized" state. Cluster becomes available only
after restart of all nodes.
I'm suspecting that "has_child" query doesn't work either on some specific
documents or on bigger datasets.
Used ES version 19.10
Cluster consists of 3 aws m1.xlarge instances. ES works with the following
jvm options:
-Xms14g -Xmx14g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-Delasticsearch -Des.foreground=yes -Djava.net.preferIPv4Stack=true
The has_child (and others like has_parent and top_children) need an
in-memory data structure to perform efficiently. The size of this data
structure depends on the number of unique parent ids and the number of
documents. It is possible that you need to increase the heap space
size for ES or increase the number of nodes.
Does this implementation take into account filters from main query and
subquery?
How many primary and replica shards you have configured for your index?
15 primary shards, 1 replica set
Are there any other possible approaches to implement relation between
different docs and to do not perform 2 requests with computing on client
side?
The main problem for me is that adding/deletion of tag will cause
reindexation of thousands of contacts if I include tag doc into contact doc.
Did you ran the nodes stats request before or after you started
testing your queries? Seems that only a fraction of the allocated heap
space is actually used.
I see that the m1.xlarge instance has 15GB of memory available. In
your case you allocated 14GB of that to ES's heap space. ES depends a
lot on the OS file system cache. Right now the OS has only 1GB left.
This can make any kind of query slow. Usually a healthy balance is 50%
of the available memory to ES and the other 50% to OS. I'd set the
ES_HEAP_SIZE to 7GB.
From what I can see from the hot threads output is that it is loading
the data structures used by the has_child query. During the first
has_child query execution on a fresh index, the data structure it
needs is loaded from disk into memory. This can make the first
execution of the has_child query slow. Subsequent has_child queries
should be much faster.
How many primary and replica shards you have configured for your index?
15 primary shards, 1 replica set
Are there any other possible approaches to implement relation between
different docs and to do not perform 2 requests with computing on client
side?
The main problem for me is that adding/deletion of tag will cause
reindexation of thousands of contacts if I include tag doc into contact doc.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.