Transform + Enrichment Policy

Hello,

I had the idea to use the target index of a sum aggregation transform as the same target index for an enrichment policy. Essentially, I want to perform a join on the field being grouped on in the transform and enrich these documents with a "name" field. This is because the name field exists in a different document type than the aggregated field. However, they have a similar field— the "Group By" field. As documents continuously are updated/written to the transformed index, they should be fed through the enrich pipeline and have the name field attached to it in a seamless fashion.

  1. How would I set the target index on the pipeline so documents are processed through the pipeline upon being indexed to the transformed index?

  2. Is it at all possible to set up a continuous enrichment policy?

  3. How much is too much load for a transform or an enrichment (i.e., how many documents)?

Thanks

When you create a transform you can choose a ingest pipeline to run, so in this ingest pipeline you would have your enrich processor.

What do you mean by continuous enrichment policy ? Enrich policies and enrich processor works best when your source data is static, if you need to constantly update the source indice of your enrich policy, than you may have some issues as you would need to execute the policy everytime the source index is updated, it is not possible to automatically execute an enrich policy to update the enrich index, there is an issue about it, but no changes.

It is not possible to know without testing, you will need to test your transform and see if it impacts your cluster or not.

You can do a "join" directly without enrich. Transform source can take an array of indices or patterns. All you need is a "primary key", so a field that exists in both indices under the same name(if that's not the case you can help yourself with aliases or runtime fields).

In the aggregation part you can copy over the fields you want from your "enrichment index" using either a top_metrics aggregation or scripted_metric.

The benefit of the "transform only" approach: An update on either index triggers the update.

Example: Is is possible to partially update a dest doc with transforms? - #4 by Hendrik_Muhs

Thanks @Hendrik_Muhs and @leandrojmp for your responses. The two document types I'm trying to join are actually part of the same index pattern, they just belong to two different datasets, although they have a similar field that would act as the "primary key". Does transform source need to take an array of two or more indices?

If you can configure both source indices with one pattern, that's fine. I only mentioned the array for cases, where the 2 indices to join are named differently.

Sorry, I think by me saying the two datasets belong to the same index pattern it made it sound like they were from two different source indices. They belong to the same index, they are just from two different inbound sources so they belong to different log types.

Can you share a data sample to illustrate. I understood that say we have doc A:

"key": "A",
"log_type": "x"
"meta": "some_variable"
... # other fields and values

and doc B (with no meta field):

"key": "A",
"log_type": "y"
...# other fields and values

Desired output A+B, grouping on key, but not on log_type:

"key": "A",
"meta": "some_variable"
...# other aggregated fields and values

It doesn't matter that doc B doesn't contain meta, top_metrics ignores empty values.

Does that illustrate your case? If not, please provide an example.

My case would more so look like this:

Doc A:

"key": "A",
"log_type": "x"
"int_field": int_val
... # other fields and values

Doc B:

"key": "A",
"log_type": "y"
"name_field": "name_val"
... # other fields and values

Desired output A+B, grouping on key

"key": "A",
"int_field": int_val
"name_field": "name_val"

Very similar to the example you included, however the combined document needs to join two crucial fields on their related key.

For both int_field and name_field I suggest to look at top_metrics. top_metrics supports more than 1 field, so you can either implement this with 1 or 2 top_metrics aggregations.

Thanks, @Hendrik_Muhs . I have been able to join those two fields on their key in a transform. I also have a separate question: Is it possible in a search query to filter only where there are name_field type documents have a corresponding key in an int_field type document and retrieve these name_field docs and int_field docs (aka only documents that can potentially be joined on)? There are currently many documents in the Transform where name field is null due to the int_field not being associated with a name_field.

I am not sure I fully understand the question. I get that you are ending up with docs like this:

"key": "A",
"int_field": int_val
"name_field": null

Using a search query you can filter out docs using exists. exists only matches docs which have a value for a field. However as a query applies to both your "left" and "right" index in a join you would drop all docs. The problem seems to be that if name_field is null you probably miss a document from one of the indices to join.

But if I understand the problem correctly you only know after the join whether that's the case. I suggest to solve this with an ingest pipeline that you can attach to the transform dest. I would drop documents that have name_field == null. You find this case in the docs here.

The only other option I see: You keep indexing the docs as is, but use exists as filter when you query your transform destination index in your dashboard/application.

Hi @Hendrik_Muhs , thanks your advices on transform "joins", I have a question building on top of emi_rose's. If you have a doc A as:

"key": "A",
"meta": "some_variable"
... # other fields and values

and then a doc B with

"key": "A",
"my_id": "123"

but the desired destination document A+B is:

"my_id": "123"
"meta": "some_variable"
... # other fields and values

As you can see, I specifically don't want to aggregate by key, because in my use case, several documents with the same my_id could have different key value, yet I want all of them in the same bucket.

Are transforms appropriate for that use case? I feel like I'd have to chain 2 transforms to another "group by" afterward :thinking:.

To make things event more complex, I have a timestamp in my doc A that I want to use for a date histogram aggregation, where it is not clear to me how to solve that "join". It all sounds very complex for basically "just" a lookup in a key-value pair index that an enrich processor could solve (where new keys are frequently added however), but perhaps I'm not looking at it from the right perspective.

Can you open a new thread for your use case? Feel free to ping me in the description, so I won't overlook it.

Can you provide a full example? I am confused:

Do you want to aggregate by my_id? You can aggregate on any field, the name of the field does not matter.

So the timestamp is not available in all docs? If that's the case, you have to join first.

Enrich works best if the data to lookup is static. It can't update enriched data after the enrich step got executed.

Whether it is better to use enrich or to use transform depends on the use case. I can recommend on or the other, when I understood the use case. As said, please open a new thread.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.