Continuous transform doesn't use checkpoint timestamp to filter search

On ES 7.7.1

I recently created a transform with an empty (match_all) query that pivots with a group_by including three terms and one histogram (not a date_histogram which I understand has some issues with transforms).

Watching the logs in the system, I can see that every time the transform checkpoints, it's using a null from in the date range part of the query ({"range":{"audit_modified":{"from":null,"to":1600080843075...).

Is this expected behavior?

Can you elaborate on that? Maybe it's just a problem with your configuration, maybe this has been fixed meanwhile.

Regarding the query, yes that's correct. Transform has to re-query all data until the checkpoint, but if you look at the other parts of the query, you should see more filters. Transform e.g. narrows the query to only update certain terms. Say only user A, C and E but not B, D have changed something on their profile. If transform runs it 1st queries for changes (you should see this one, too) and than recomputes the pivot for A, C and E but not B, D.

You might now ask: Why is it not taking just the new values and update the doc?

To explain the challenge: Transform is a generic tool and supports a lot of different aggregations. To illustrate the "Update instead of Re-compute" problem:

  • min/max/sum are easy
  • for average we could store sum and count to make it update-able to update a median you need a histogram, fortunately we have that now (histogram datatype)
  • for cardinality we have to store the sketch, e.g. the hyperloglog data structure, we do not have such a data type yet
  • for scripted metric we need the user to write the update method

This doesn't mean we do not want to support update at all, we have plans to implement this in future. Due to the challenges explained, we likely will not support every aggregation/data type to be update-able or at least add support step by step over time.

I had seen a post (Continous transforms accumulating delay, can be tweak for speed up?) in which date_histogram was suggested as potentially impacting use of date in the transform searches.

You're right, I do see in the searches a set of other terms taken in account that seems to vary with updated documents.

Thanks for the explanation; that makes sense. I just wanted to be sure I hadn't done something wrong.

The issue with date_histogram only applies if (and only if) you want to use different fields for sync and group_by, if they are the same, it's ok.

Or let's state it the other way around: If the same field is used for sync and group_by, transform applies an optimization, otherwise not. An optimizer that works for different fields is planned, no ETA, but high on the list.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.