Can a transform recalculate if old documents update their values?

Hi all!

I have an index that contains millions of documents that have this structure:
{
id: abc87452
timestamp: 2021-05-13T12:45:13Z
client : company_X
value : $ 10000
transaction_status : authorized
}

So, I have a requirement to make a transform job in elasticsearch that sums all the value for all the clients. But some of this documents/transactions have the field status updated a few days later, like:

{
id: abc87452
timestamp: 2021-05-13T12:45:13Z
client : company_X
value : $ 10000
transaction_status : cancelled
}

Can elasticsearch transform handle this? Suppose that day 13 it ran, the transform sum the values and gives $50000. But day 14, the transaction was cancelled. Does the transform update the sum to $ 40000?

Hope that I could be clear, but I can provide more details.

Thanks in advance!

Hi @marcosvrrs !

A transform will not detect changes to existing documents, but it will recalculate your the aggregation each time new documents come in for your group by field.

Using your example, once a transaction for company_X is cancelled, the transform will recalculate sum(value) the next time it sees a completed transaction for company_X.

Hope that helps!

Hi @blaklaybul , thanks for the reply.

I am not sure I understood fully the idea.

1 - If a thousand documents are overwritten, the transform won't update, right?
2 - "the transform will recalculate sum(value) the next time it sees a completed transaction for company_X " , I don't understand what you mean by completed transaction. A new document arriving at the index?

The group by criteria that I'm using is:
group by day, than group by client, than group by transaction_status. As I said, a few days later some of the transactions will change the "bucket of the group by" as the transaction_status changed. The only alternative is delete the index and recalculate the transform?

In order to let transform update the destination index as the new data arrives it must run in continuous mode. Otherwise it runs as batch, which is a one off operation.

To use continuous you need to specify sync in you config, this can be done in the UI as well as API:

{
    "source": {...},
    "dest": {...},
    "pivot":{...},
    "sync": {
        "time": {
             "field": "a_timestamp_field",
             "delay": ""   # <- optional, default '60s'
        }
    }

The configured timestamp field must meet requirements: It must be modeled after the real clock and it must not be in the past. Using delay you can adjust that, with a delay of 60s transform will not read data until 60s have passed. That means within 60s data can arrive late and/or in any order.

I am not sure your timestamp field meets those requirements. Your example contains the same timestamp for both documents. Does that mean timestamp is the order date?

1. timestamp is the order date

If so, timestamp is not suitable for continuous operation, you could set delay to the maximum time a order can be cancelled, but that way transform would wait days until it does something.

The solution for this is to add an ingest timestamp. This timestamp can be used for sync, but for group_by you can still use timestamp.

2. timestamp is already the ingest timestamp

If timestamp is already an ingest timestamp or at least the timestamp of the operation set by your application, you only need to ensure delay is configured properly. Ensure documents arrive to elasticsearch within the budget you set in delay, e.g. 60s.

However, now you have a problem with your group_by. You bin the timestamp into daily buckets, but if timestamp isn't the order date, the values can be spread over several buckets. What you need is group_by(min(timestamp)), this is not possible with a single transform. For this case you have to group_by transaction id (and maybe your other group_by's) and add an aggregation with order_date = min(timestamp) to it. On this output you can either use another transform or just straight create your visualization, which probably works fine as your 1st transform already reduced the amount of data significantly.

I hope that helps, if not, please share some more information: the configuration you are using and a description of the fields, e.g. the nature of the timestamp field.

P.S.

I guess this contains a mistake, you are also grouping by id, right?

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.