Continuous transformation averaging

Hi,
We collect device behavior/events during the day in a timeline index. Devices are turn on and off by customers and events are sent as they happen and each event provides event name and time consumed for each event.
We need to have a daily aggregated report grouped by device, event, averaging and summarizing the time field for the last 7 days.

What is the best way to achieve this considering this query will be frequently used by each device through API ? Is it better to run a transformation and move data into a new index for fast access or just run the query directly on the logs index ?

ps: we tried the Transformation feature but don't know how to make it re-use or re-calculate the same data again. Transformation is not calculating data that has already calculated once.

Thank you

This can be achieved by configuring transform in continuous mode. A continuous transform will recalculate buckets as new data comes in. You find a lot in the documentation.

You might stumble upon a problem. In your use case the timestamp of the event seems not reliable. With the default settings of a continuous transform new data is expected to arrive no later than 60s after the event has happened. The setting is called delay. You can adjust delay and set it to a higher value. This compensates the so called ingest delay. Increasing delay has a price attached. Transform will not query new data that is younger than delay, making the dashboards/charts lagging behind, too. That's why this is not a good solution for your case.

The suggestion for you is to add another timestamp on ingest when the data comes in:

{
    `timestamp` : 2022-03-01T02:35:53+00:00
    `ingest_timestamp`: 2022-03-02T07:35:53+00:00
   ... # your other fields
}

Assume the time is 2022-03-02T07:36:13+00:00, a continuous transform on the field timestamp would require a setting for delay with more than 24h to pickup the data, but if configured on ingest_timestamp the default of 60s works well. As the ingest timestamp is set when the data comes in it is super reliable and you can probably even decrease the default as it will not take 60s to process and index it. A clock skew or wrongly configured clock is less likely on your server than on the customer device, so this is another reason for using an ingest timestamp. A fully detailed description about how this works can be found in the docs.

In your requirements you said you want daily aggregations. This can be done using a pivot transform using a date_histogram aggregation. You can configure the date_histogram using your timestamp field from the device. With other words, the fields for continuous operation (part of sync) must not be the same as for the date_histogram, but you can use ingest_timestamp for sync and timestamp for the date_histogram (see this PR for details).

Information about adding a timestamp on ingest can be found here.

This is a matter of performance. You don't have to use a transform, but you can do runtime queries as well. If you query the logs index a lot, there is a break even point where the transform solution will perform better. The transform index basically acts as a cache in this case. Note that runtime queries use caches, too. To achieve better cache performance try to tweak you queries to hit the cache more often. In your case rounded dates will help, have a look into the performance guide.

Hi Hendrik,
regarding transformation, i just wanted to add something I did not explain correctly:

On the destination Index I don't have to have any timing, date-time filter. So all i need is to run the transformation at least once a day (more often better), and calculate the information of the past 7 days. So let's assume transformation runs once a day, on the very first run it will correctly calculate the past 7 days, but on the second run it should recalculate the last 6 days plust current day, and update values on destination index, or create if missing. Currently transformation does not consider the past 6 days.

So at any time on the destination Index the number of rows should be: number of devices x number of events. No need for additional timing fields as I already know that values always represent the last 7 days. So the destination index should have only one row per use per event.

If this is not possible without including the fields you recommended then i will give it a try,

Thank you

Thanks for the follow up explanation.

What you describe is a scheduled re-run of a batch transform. I think there is little benefit of trying to run this as a continuous transform as you describe. You could instead re-create the transform every day, e.g. by re-creating and re-running it. The next release will make this easier as it will provide a reset API. Currently you have to delete transform and destination, re-create the transform and start it yourself. With reset this can be reduced to 2 API calls. We will improve the use case of re-running a batch transform in further releases.

However I have a better suggestion: The destination of a transform is an index and works as any other index. You can query that index and further aggregate on it. You could use transform to rollup the data into e.g. 1 day buckets. To retrieve your 7 day value you run another aggregation on the transform destination. This query will be lightning fast, because it only has to aggregate on a tiny amount of documents. I would actually make this more fine granular and e.g. bucket in 10 minutes intervals. The trade off is speed vs. size. It basically depends on your amount of data, you have to try yourself and find the right balance.

If you think about the suggested approach you might want to look into rollup as alternative to transform.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.