Has_child / has_parent for billions of children - heavy cpu load for simple queries?

Hey everyone,

is there a trick to speed up has_child / has_parent joins for small subsets of large document bodies? We have around 90 million parent documents and close to 1.4 billion child documents, seven 10-core machines with 64GB ram and three emergency-scrambled fast-clocked 4 core machines with 64GB ram.

We already had huge memory problems with random_sort via function query, eating >60GB of heap per query, which forced us to pre-calculate a random_order to avoid that. Now we found the next pitfall: using has_child/has_parent for small subsets of children/parent documents guzzles cpu time... and the response times grow from 9-15ms to 400-1500ms as soon as we touch parents/childs (depending on load).

We already use eager global ordinals, don't score the childs/parents and now I'm left wondering what we could do, except, of course, throwing more servers at the problem. Maybe like-data saving is a extremely bad use case for Elasticsearch? (I don't want to venture back to sql-land... :sweat_smile: )

Any feedback highly welcome and thanks in advance! :slight_smile:

Oh, and here's one of the offending queries:

/instagram-user/instagram-like/_search
{
   "query": {
      "bool": {
     "filter": [
        {
           "term": {
              "instagram_post_id": 1312932664928280000
           }
        },
        {
           "has_parent": {
              "parent_type": "instagram-user",
              "score_mode": "none",
              "query": {
                 "bool": {
                    "must_not": [
                       {
                          "exists": {
                             "field": "calculated"
                          }
                       }
                    ]
                 }
              }
           }
        }
     ]
  }
   },
   "size": 0
}

Parent/child queries are slow by design. It runs a first phase that consists in identifying the matching join values, and then a second phase that consists in matching the documents that have these join values. However this second phase usually runs a linear scan in order to find matches, which is slow as it runs in linear time with the number of documents in the joined type.

You may try to avoid the parent/child paradigm by inserting as much data as possible from the child documents into the parent ones.

One thing you may try which may or may not help is to increase the number of shards (primary) in your index. That way, each shard would have a smaller set of join values to iterate through.

This was true for ES < 2.0. New indices created on 2.0 and onwards the second phase is linear to the amount of parent docs that match with the rest of the query. In your case that are documents that match with the term query on the instagram_post_id field. How many documents do match this term query? If that number is small then I do expect reasonable performance.

The documents matching count depends on the size of the users and the age of the post, ranging from the hundreds up to a few hundred thousand.

We figured that, in our case, we shouldn't join children into parents, since this would mean joining all child documents for uncalculated users (~50 million parents) before applying the child-query filters - or does ES apply has_child-query filters before joining the children?

If that's not the case we could try to increase the number of primary shards or start thinking about denormalizing the bool flag into the children (resulting in updating millions of documents after doing a simple calculation on the parent) or moving an aggregated liked-users into the parents (resulting in steady stream of reindexes)... decisions, decisions, yay :slight_smile:

In the first phase, the filters / queries inside the has_child/has_parent query are basically used to select what child document match. So these filters are applied, it is just that this isn't the most expensive part of the has_child/has_parent query. It is the second phase that connects the child hits back to their parent hits. This is linear, all parents need to be checked if one of its child documents has been matched in the first phase (or the other way around with has_parent). However if the has_child/has_parent query are part of bigger query than not all parent queries need to be checked for whether they have child matches. In the query you shared here only child documents that match with the term query will be checked if their associated parent document were matched during the first phase of the has_parent query. For this to work this way you should be running this query on an index created on or after ES 2.0, otherwise all child document will be checked for whether they matched. Are you on a ES 2.x version? Did you upgrade from 1.x? In that case you need to reindex the indices created before the upgrade in order for p/c to execute this way.

If you want super fast searches then parent/child design isn't meant to be used here. De-normalizing the data is the only way. So, yes, decisions :slight_smile: