I'm using transforms to aggregate our access logs to a daily aggregate, e.g. "customer A had 500 requests totalling 1000 credits".
Problem
The problem that I now have is that I want to verify that the transformed data adds up, i.e. is the count of requests the same? So I do the request on both indices, the results are really close, but not the same:
My intuition
The differences (last column) are so low that I suspect that there's some timing issue, but neither timezones nor transformation lag make sense to me.
Sources & code
Here's my transform code, the queries for the normal index and for the transform index, basically identical except for the normal one needed an additional aggregation to sum the requests which the other already has through the transform and change of the field name.
As part of your queries I see you define a timezone. But you don't define that timezone in the transform. There you pre-aggregate with a date_histogram, so bucketing happens there. That means date bucketing is already done in the transformed index. I think you should define the timezone in the transform or configure the date_histogram with more granularity, e.g. 1h. That way your query on the transformed index can adjust the buckets. Now you basically already lost the precision after the transform.
Another reason for the mismatch might be the terms grouping. Can you verify that the customer field is never null? Transform by default ignores it otherwise or you set missing_bucket to true.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.