Has_child query slow due to global ordinals - either at refresh or query time, looking for workaround


I'm migrating from ES 1.1.1 to 2.4.1, and have run into an issue while testing some parent/child queries.

Specifically, some has_child queries were taking a long time to return as compared to the old cluster (2-3+ seconds vs ~500ms). This is tested via curl, and I'm using the returned "took" field as the timing. I was unable to find the queries in the slow query logs, so ended up dropping the warn time down to 0ms to log everything. When I actually found the queries, the logs are reporting a much more acceptable ~500ms as the old cluster was, but the queries are still taking much longer to actually return via curl.

Does anyone have any idea why such a discrepancy would exist? The cluster is actively being indexed into (pretty constantly), but isn't being used for any searches yet. Non has_child queries (I haven't tested has_parent yet, as they never performed well enough for me to use in the old cluster) are returning fast as expected. I am using custom routing for what it's worth.


Checking hot_threads while the query is running show that the time appears to be spent in ParentChildIndexFieldData.buildOrdinalMap.

I'm guessing that that occurs prior to the actual search, which is then actually fast. I had assumed that the ordinals would only need to be rebuilt after a merge, am I wrong here? I'm actively importing more or less constantly, but not THAT much data, and running the query back-to-back results in the same slowness.

The warnings about performance scared me away from eagerly loading the ordinals, but perhaps that's the way to go. I do often need freshly indexed data to be searchable immediately as well, and it seems to say that the ordinals are rebuilt every refresh?

Am I missing something here?

Also, just to verify, there's no way to revert back to the old id cache behavior in ES 2.4? I can't disable doc_values for the _parent field.

Thus far the eager_global_ordinals setting, while making the queries fast, slows the refreshes something awful (~10 seconds), and I need to be have newly-indexed data available pretty much real-time. The 10-second refresh wait makes this unworkable.

The only other thing I can think of is to keep a higher refresh (10-30 seconds from 1 second), disable lazy loading, and manually issue refreshes when the real-time behavior is needed. This still gives pretty common terrible performance though. It seems like support for this use-case, which I was relying upon, has effectively been removed. I'm open to any workaround anyone can suggest.




In this it makes sense to increase the refresh time to be equal or higher than the time it takes to load global ordinals.

What I also suggest is trying to conservatively increase the number of primary shards, so that more shards concurrently load the same global ordinals data structure. (add two more primary shards and see if that has the desired effect, if not add a few more)

From 2.0 the parent/child feature changed to completely rely on global ordinals. This gave parent/child queries a performance boost, but does require have global ordinals loaded, which in your case adds 10s to either the execution of the first query after a refresh or the refresh itself (depending on whether eager or lazy global ordinals has been configured). This is a trade off. Prior to ES 2.0 when a parent/child query matched with more than a just a few document the performance could be really terrible. (more than 10 seconds). Also when many parents or child matches matched, internally the parent/child queries were effectively building a global ordinals data structure for just that query execution and throwing it away after the query has completed. The global ordinals that the queries use now, is reused between query executions until a new refresh happens.

Thanks for the response. Your previous posts on the subject have been very helpful as well.

For whatever reason, the previous performance (1.1) for us was very acceptable for has_child queries (filters really, for what it's worth). has_parent truly was unacceptable and I ended up replicating parent data in child types to avoid needing to actually issue these queries. I'm not sure if we were matching few enough records that things performed well, if we have an odd parent/child distribution, or how we managed to make it usable, but it was working.

In 2.4, I've ended up keeping the relationship, but avoiding the has_child/has_parent queries unless there are large numbers (for us, ~20K+, but I'm still toying with the threshold) of matching child records. Otherwise I'm just aggregating on the related ID and using a terms query to limit the related type. This provides good performance in the normal case, with the potential to avoid an ever-growing list of ids in the abnormal one. Again, maybe we just don't have enough matching child records on average to truly benefit from the new behavior, but this has gotten me moving forward again.

I've also considered re-sharding, but it seemed like it would take a lot to bring the refresh number down to an acceptable level from where it was. But that's still an idea I'm considering going forward.

Thanks again for the help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.