Let me chime in as well.
Our problem is similar. We have about 700M items and add items at a speed
of 100/s. The performance we are seeing is not great. The query time
required (has_child) is dependent on the amount on new items indexed (that
makes sense). But that time is already seconds(!) after adding a couple of
thousand new items. And many minutes if we leave it running for a while.
We are currently investigating whether it is possible to add the ids to the
internal memory map as soon as they are indexed.
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:
Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.
Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.
First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.
I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?
Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?
Thanks,
-Tim
On Thursday, 23 February 2012 19:59:44 UTC+1, Tim J wrote:
Hey folks,
I'm currently working with an ES index of roughly 52 million
documents. We index approximately 10-20 new docs per second. Each
document is broken into two pieces and indexed as a parent/child
pair. The child contains static content and is unlikely to ever be
updated. The parent fields are modified frequently which is why the
child content was separated, particularly as the original source for
the child documents is expensive to retrieve.
Documents are replicated across three nodes. No data is stored with
the exception of a unique id for each doc. Each node is allocated 8
GB of RAM and we occupy about 22 GB per node on disk. We use the
routing key to "shard" our data. There are approximately 130
different routing keys in use at the moment. Routing keys are also
used as conditions for all searches so they should be a quick filter.
First, does anyone have a sense of the penalty we're paying for
having this parent/child relationship? We're seeing some very long
query times particularly when we're actively writing to the nodes.
Sometimes a simple query with one condition on the parent and one in a
has_child can take 8+ minutes.
I've noticed that when we're doing a lot of writes to the child
index in particular the times go up significantly. On the other hand
if we only write to the parent index this is much less of a problem.
Is this expected?
Finally, does anyone have any suggestions for tuning this
configuration or improving our queries?
Thanks,
-Tim