Aggregating entities from transaction documents using transforms

Hello,

We have an application that models transactional data. Each document describes a single transaction between two entities.

Our current import process is done in two phases. First we ingest all the transactions into a transactions index. Afterwards, we run a suite of aggregations to against that index to build an entities index which contains aggregated fields from the transaction documents along with some higher level statistics (ex. total transactions, total transaction sum, etc). However over years this system has grown to become unwieldy and slow.

A colleague recommended using elasticsearch transforms for this and at first glance it seemed like a perfect fit. However we have run into implementation issues due to the fact that each transaction aggregates into two entities.

For example, these two transactions:

{
    ids: [
        a, b
    ],
    src_id: a,
    dst_id: b,
    src_field: 5,
    dst_field: 6
},
{
    ids: [
        b, c
    ],
    src_id: b,
    dst_id: c,
    src_field: 3,
    dst_field: 12
}

Which can describe the following relationships:

entityA     entityB    entityC
   ^        ^    ^       ^
    \      /      \     /
     \    /        \   /
transactionAB    transactionBC

Would need to transform into three entity docs:

{
    id: a,
    count: 1,
    field_sum: 5
},
{
    id: b,
    count: 2,
    field_sum: 9
},
{
    id: c,
    count: 1,
    field_sum: 12
}

I've attempted running two separate transforms, one that groups on the src_id side of the transaction, and the another on the dst_id side of the transaction, aggregating the respective fields, but they overwrite each others output in the write index.

I've attempted running it as a single transform, which aggregates on the id field, but then the src / dst fields become mixed together.

I think this could be solved using three transforms and an intermediary index, but at that point the complexity isn't worth it in our use case.

Is there anyway to achieve the desired results without the use of intermediary indices? I know elasticsearch has a alot of functionality with regards to scripted aggregations and scripted processors. I was hoping someone may be able to , but I'm not sure if they can solve this issue.

This seems like a quite complex task. Do I get this right, you basically want to treat src and dst the same way? The only required logic seems to be to use _field dependent on src or dst.

So option 1 seems to me to flatten the data before the transform. If you are able either in your application or using some sort of re-index to index 4 documents instead of 2 with:

{
    id: a,
    field: 5
},
{    
    id:b,
    field: 6
},
{
    id: b,
    field: 3
},
{
    id: c,
    field 12
}

After that 1 transform could do the job.

It is not a good idea to let 2 transforms write into 1 index, you can easily let them write into separate indexes and query using a pattern e.g. transaction_by_src, transaction_by_dst for the transforms and transaction_by_* to query the data.

Do you need the 3rd transform? If I get it right, the 3rd transform is just to combine the results of the other 2, that could be done query side. The 3rd transform would only combine either 1 or 2 documents per bucket. However your other 2 transforms probably reduce the data by larger factor. How many transactions and how many entities do you have?

As said, the only way to avoid more than 1 transform is to flatten the data before transform. Your initial problem seems to be: every document is really 2 documents, so you need a 1:2 mapping operation, but everything you can do e.g. in a scripted field is doing a 1:1 mapping. Fixing ingest can solve the problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.