Performance penalty for has_child queries

Tim_J · February 23, 2012, 6:59pm

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

kimchy · February 26, 2012, 7:11pm

The main penalty that occurs when using parent/child mapping happens because the ids need to be loaded to memory in order to do an efficient join process between the child and the parent. This initial loading of the ids can be expensive, but subsequent requests will be fast, even when indexing data.

On Thursday, February 23, 2012 at 8:59 PM, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

Tim_J · February 27, 2012, 3:20pm

Thanks Shay. So it sounds like there's a cache to manage the joins.
Is there anything that would cause the cache of ids to be cleared?
How is it managed? I'm just wondering if we're doing something silly
that would cause it to get nuked periodically.

Thanks,
-Tim

On Feb 26, 2:11 pm, Shay Banon kim...@gmail.com wrote:

The main penalty that occurs when using parent/child mapping happens because the ids need to be loaded to memory in order to do an efficient join process between the child and the parent. This initial loading of the ids can be expensive, but subsequent requests will be fast, even when indexing data.

On Thursday, February 23, 2012 at 8:59 PM, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

Nick_Hoffman · February 27, 2012, 5:46pm

On Sunday, 26 February 2012 14:11:50 UTC-5, kimchy wrote:

The main penalty that occurs when using parent/child mapping happens
because the ids need to be loaded to memory in order to do an efficient
join process between the child and the parent. This initial loading of the
ids can be expensive, but subsequent requests will be fast, even when
indexing data.

When you said that the IDs need to be loaded into memory, am I correct in
interpreting that as all of the parents' and childrens' IDs?

For example, if there are 10K parents and 20K children, 30K IDs will be
loaded into memory to perform the join?

Thanks, Shay.

Nick_Hoffman · February 27, 2012, 6:34pm

I'm in the same boat as Tim:

Product documents are expensive to generate, and change infrequently.
ProductHave documents (which track which users want each product) are
cheap to generate, and change frequently.

Thus, I was planning on using a parent-child relationship, where the
"product_have" type is a child of the "product" type. This would make it
cheap and trivial to update which users have a product.

However, if performance will go down the drain when there are millions of
documents, there must be a better solution. Any ideas?

Radu_Gheorghe1 · February 28, 2012, 8:49am

Hi Nick,

I'm having a similar problem, so I'll just share my thoughts here,
without expecting them to be "solutions".

When you have documents that are related to each other, there are
three options:

use a "relational" database. But that would hurt scalability, of
course
manage relations yourself. I'm thinking about holding different
types of documents in different indices, and then build the
"relations" within your application's logic
structure your data according to how your queries will look like. I
got this idea from a book on Cassandra :D. For example, it might make
sense to hold all the data in one document (non-nested), and update it
with new data (eg: new fields for new customers).

On Feb 27, 8:34 pm, Nick Hoffman n...@deadorange.com wrote:

I'm in the same boat as Tim:

Product documents are expensive to generate, and change infrequently.

ProductHave documents (which track which users want each product) are
cheap to generate, and change frequently.

Thus, I was planning on using a parent-child relationship, where the
"product_have" type is a child of the "product" type. This would make it
cheap and trivial to update which users have a product.

However, if performance will go down the drain when there are millions of
documents, there must be a better solution. Any ideas?

haarts · February 28, 2012, 11:28am

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

Tim_J · February 28, 2012, 1:19pm

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

haarts · February 28, 2012, 1:39pm

I'm currently looking for that area. I'll keep you posted.
In the meanwhile I'm also testing a hack.
while true
perform has_child query
sleep 1
end

Seems to work surprisingly well. But we are not going to leave that in
place of course.

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

haarts · February 28, 2012, 3:59pm

Hi Tim,

Just an update.
We are encountering some unexpected behaviour when running the same
has_child query twice consecutively. The first query takes about 30 minutes
to complete, as does the second one. This in contrary to the believe that
once the IDs are loaded in memory search should be fast. The index contains
36M documents and the index size is 30GB running on an 8 core i7 with 24GB
RAM.

Regarding your question on when an ID is loaded to the memory map; I
believe they are all loaded all the time.
I believe the code responsible is in
java/org/elasticsearch/index/cache/id/simple/SimpleIdCache.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/cache/id/simple/SimpleIdCache.java
.

Harm

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

Tim_J · February 29, 2012, 2:11pm

Thanks Harm. I think we may be looking at alternative solutions for
this one as our deadline is fast approaching. I'll be keeping an eye
on this thread though in case you turn up anything good!

-Tim

On Feb 28, 10:59 am, haarts harmaa...@gmail.com wrote:

Hi Tim,

Just an update.
We are encountering some unexpected behaviour when running the same
has_child query twice consecutively. The first query takes about 30 minutes
to complete, as does the second one. This in contrary to the believe that
once the IDs are loaded in memory search should be fast. The index contains
36M documents and the index size is 30GB running on an 8 core i7 with 24GB
RAM.

Regarding your question on when an ID is loaded to the memory map; I
believe they are all loaded all the time.
I believe the code responsible is in
java/org/elasticsearch/index/cache/id/simple/SimpleIdCache.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/j...
.

Harm

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a
speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed
(that
makes sense). But that time is already seconds(!) after adding a couple
of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to
the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with

...

read more »

kimchy · February 29, 2012, 2:11pm

Do you have a replica for the shard? If so, it will also need to be loaded on the replicas. See the other thread about why the cache itself is not really reloaded or managed for each change (or refresh).

On Tuesday, February 28, 2012 at 5:59 PM, haarts wrote:

Hi Tim,

Just an update.
We are encountering some unexpected behaviour when running the same has_child query twice consecutively. The first query takes about 30 minutes to complete, as does the second one. This in contrary to the believe that once the IDs are loaded in memory search should be fast. The index contains 36M documents and the index size is 30GB running on an 8 core i7 with 24GB RAM.

Regarding your question on when an ID is loaded to the memory map; I believe they are all loaded all the time.
I believe the code responsible is in java/org/elasticsearch/index/cache/id/simple/SimpleIdCache.java (https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/cache/id/simple/SimpleIdCache.java).

Harm

On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:

Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:
Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Tuesday, 28 February 2012 14:19:40 UTC+1, Tim J wrote:
Hey haarts,
I'd be really interested to see what you come up with here. So from
the sound of it, until a document turns up in a query it's not added
to the memory map? Can you point me to the area of code where that
memory map is managed?

Thanks,
-Tim

On Feb 28, 6:28 am, haarts harmaa...@gmail.com wrote:

Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.

We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.

On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

Nick_Hoffman · February 29, 2012, 2:42pm

On Wednesday, 29 February 2012 09:11:00 UTC-5, Tim J wrote:

Thanks Harm. I think we may be looking at alternative solutions for
this one as our deadline is fast approaching. I'll be keeping an eye
on this thread though in case you turn up anything good!

-Tim

Are you able to provide additional selection restrictions to the has_child
query? I would imagine that the more restrictive you can be for which
children match the has_child query, the faster your query search will be.

Tim_J · February 29, 2012, 5:44pm

Unfortunately our query is about as restricted as it can be. The
child documents contain a single field (text content) and a _routing
key. All our queries are only interested in one _routing key so we
use that as a filter. Otherwise, it's just a query against the
content.

That does does raise an interesting question though. We often add a
number of filters to the parent document which would significantly
reduce the result set. Would it make any sense to reverse that parent/
child relationship?

Thanks,
-Tim

On Feb 29, 9:42 am, Nick Hoffman n...@deadorange.com wrote:

On Wednesday, 29 February 2012 09:11:00 UTC-5, Tim J wrote:

Thanks Harm. I think we may be looking at alternative solutions for
this one as our deadline is fast approaching. I'll be keeping an eye
on this thread though in case you turn up anything good!

-Tim

Are you able to provide additional selection restrictions to the has_child
query? I would imagine that the more restrictive you can be for which
children match the has_child query, the faster your query search will be.

Serg_Pilipenko · October 21, 2012, 10:25pm

I've created issue here
Improve implementation of SimpleIdCache · Issue #2343 · elastic/elasticsearch · GitHub Please vote

четверг, 23 февраля 2012 г., 19:59:44 UTC+1 пользователь Tim J написал:

Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.

Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.

First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.

I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?

Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?

Thanks,
-Tim

--

Topic		Replies	Views
Very slow has_child query for large index Elasticsearch	15	1572	July 6, 2017
Parent/Child query performance in version 1.1.2 Elasticsearch	7	488	July 6, 2017
Parent/child query performance Elasticsearch	5	412	July 6, 2017
Older versions of parent document returned in has_child query for Parent/Child setup Elasticsearch	16	824	July 6, 2017
Has_child / has_parent queries for a large DB Elasticsearch	3	404	July 6, 2017

Performance penalty for has_child queries

Related topics