I have a transform that aggregates my stats index data for long-term storage, but the results are no longer correct.
My stats index stores each request (one document = one request) and stores the credits used by that request, e.g. 6. I transform this into one-day aggregations, e.g. customer A did 1200 requests on Monday with a total credits 2400. The aggregations for the transform are:
I run this with a frequency of 1h and this sync: {"time": {"field": "timestamp","delay": "90m"}} and I group by a 1d fixed interval and by customer. Here's the entire transform json, you can ignore the scripted fields.
When I then graph the count of records and the sum of credits, neither graph is the same between a normal index and the transform. Why is that?
You specified a query filter that limits transform to only access data from the last 3 hours, actually even just 1.5 hours, because of the 1.5 hour delay:
Such a filter is never a good idea. Transform takes care of query filters themselves. You find an explaination about checkointing in the docs. Such a query is only useful if you want to set a fixed start date, e.g. if you want to prevent to transform all data from the past.
Which version are you using? If relatively recent, I suggest to read about settings.align_checkpoints. This setting has a default value of true, it causes the transform to wait until a bucket is complete before it creates it. In your case of 1 day buckets that means it waits until midnight plus your 1.5 hour delay, so roughly creates new data between 1:30 - 2:30 in the morning. If you prefer to see intermediate results during the day you can set align_checkpoints to false. In this case the transform will update the results during the day. As this means more work, it comes for the price of performance.
TL/DR
If you use a relatively recent version and align_checkpoints is set to true your query from above causes data problems, because transform is only able to get data from at most 1.5 hours of the last day.
Regarding your unusually large query delay: What's the reason to configure a 1.5 hour delay? As alternative you could use an ingest timestamp instead.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.