Transform missing data

rokcarl · July 4, 2022, 1:56pm

I have a transform that aggregates my stats index data for long-term storage, but the results are no longer correct.

My stats index stores each request (one document = one request) and stores the credits used by that request, e.g. 6. I transform this into one-day aggregations, e.g. customer A did 1200 requests on Monday with a total credits 2400. The aggregations for the transform are:

credits: { "sum": { "field": "credits" } },
requests: { "value_count": { "field": "timestamp" }}.

I run this with a frequency of 1h and this sync: {"time": {"field": "timestamp","delay": "90m"}} and I group by a 1d fixed interval and by customer. Here's the entire transform json, you can ignore the scripted fields.

When I then graph the count of records and the sum of credits, neither graph is the same between a normal index and the transform. Why is that?

left is real graph, right is transform.

Do I need to run the transform at the end of the day? As far as I'm aware that's not needed?

Hendrik_Muhs · July 5, 2022, 8:02am

You specified a query filter that limits transform to only access data from the last 3 hours, actually even just 1.5 hours, because of the 1.5 hour delay:

"must": [{
                    "range" : {"timestamp": {"gte" : "now-3h"}}
                }]

Such a filter is never a good idea. Transform takes care of query filters themselves. You find an explaination about checkointing in the docs. Such a query is only useful if you want to set a fixed start date, e.g. if you want to prevent to transform all data from the past.

Which version are you using? If relatively recent, I suggest to read about settings.align_checkpoints. This setting has a default value of true, it causes the transform to wait until a bucket is complete before it creates it. In your case of 1 day buckets that means it waits until midnight plus your 1.5 hour delay, so roughly creates new data between 1:30 - 2:30 in the morning. If you prefer to see intermediate results during the day you can set align_checkpoints to false. In this case the transform will update the results during the day. As this means more work, it comes for the price of performance.

TL/DR
If you use a relatively recent version and align_checkpoints is set to true your query from above causes data problems, because transform is only able to get data from at most 1.5 hours of the last day.

Regarding your unusually large query delay: What's the reason to configure a 1.5 hour delay? As alternative you could use an ingest timestamp instead.

rokcarl · July 14, 2022, 2:12pm

Thank you, that was it and it fixed my problem.

I'm facing a new one though about data mismatch, but I opened a new thread for that.

system · August 11, 2022, 2:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Transform data mismatch with source index Elasticsearch transforms	3	533	August 12, 2022
Transforms: do I need to filter source for time-series data? Elasticsearch transforms	10	1164	July 16, 2021
Transformed index is missing data Elasticsearch transforms	4	557	March 23, 2021
Transforms updates fields from data that exists from before even a filter condition was not met Elasticsearch transforms	4	473	December 23, 2020
Is there a soon update for transforms API Elasticsearch transforms	6	511	November 4, 2021

Transform missing data

Related topics