About performance using has_child query for filtering

Hi guys,

I have to filter users that already interacted with each other, so they don't appear in search anymore.

I'm planning to do this filtering using a has_child query. The main user document will be the parent, and a child document will be created for each interaction.

I'm a bit concerned about these points:

  • Sparse data: The child documents will have just a few fields compared to the parent ones. I will deactivate doc_values for the fields I'm not pretending to sort and aggregate with, but still.

  • Number of documents will larger in shards with heavy users: Parent document and all of its children must live on the same shard.

  • Performance of the has_child query itself

Thoughts?

This is a tough problem. I guess you have two options:

  1. use parent/child like you describe

  2. add an interacted_with field to all your documents and update your documents to append to this list whenever there are interactions

  3. will be slow and makes sharding complicated while 2. will become problematic if you start having millions of entries for the interacted_with field.

Maybe it's best to make conscious trade-offs. For instance you could go with option 2 and make the interacted_with field a rolling buffer: whenever it reaches its maximum size (eg. 10000) then instead of just appending to it, you would also remove the first value, which is the 10000th user that the current user interacted with last. It is less correct but help keep the problem bounded and might be good enough for your users.

Thanks for your reply!

Actually we're currently using the interacted_with method, but we found that the update frequency of the interacted_with field is very high so each user document is recreated many times in a short time causing:

  • Creation of a lot of garbage (many segments merge etc)
  • During peak time there are many conflicts, making difficult to keep the interacted_with field updated correctly
  • For heavy users interacted_with field is quite big (making it a a rolling buffer as you suggested could be good)

With parent/child each time a new document will be created, so we won't need to update the user documents each time, solving the above problems. But I wonder about what are the new problems we can face.

When you say slow, what do you refer to? the has_child query itself?

Regards!

This is exacty the trade-off of parent/child: it will help indexing, but will make querying much slower. Parent/child queries are with script queries our only queries that might have to resort to a full scan in order to identify matches. Depending on the index size, this may be significantly slower than your current approach. Also, like you already noted, it makes sharding more complicated.

Maybe there are ways that you could work around it on the client-side. For instance by keeping the list of recent interactions in a cache and applying them manually (by having one must_not clause on the interacted_with field and another one on an ids query whose ids would be retrieved from this cache). Then you can periodically move the list of recent interactions from your cache to Elasticsearch and clear your cache. This should help decrease the update rate?

They will never be executed alone, but altogether with other filters. Would that avoid full scans?

Thanks for the suggestion.
We already contemplated doing some workaround, but due to how our system works currently, implementing something like that make the complexity increase a lot. So we were looking for a way to do it just within Elasticsearch, but all the ES-only solutions have trade-off I guess.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.