I had the idea to use the target index of a sum aggregation transform as the same target index for an enrichment policy. Essentially, I want to perform a join on the field being grouped on in the transform and enrich these documents with a "name" field. This is because the name field exists in a different document type than the aggregated field. However, they have a similar field— the "Group By" field. As documents continuously are updated/written to the transformed index, they should be fed through the enrich pipeline and have the name field attached to it in a seamless fashion.
How would I set the target index on the pipeline so documents are processed through the pipeline upon being indexed to the transformed index?
Is it at all possible to set up a continuous enrichment policy?
How much is too much load for a transform or an enrichment (i.e., how many documents)?
When you create a transform you can choose a ingest pipeline to run, so in this ingest pipeline you would have your enrich processor.
What do you mean by continuous enrichment policy ? Enrich policies and enrich processor works best when your source data is static, if you need to constantly update the source indice of your enrich policy, than you may have some issues as you would need to execute the policy everytime the source index is updated, it is not possible to automatically execute an enrich policy to update the enrich index, there is an issue about it, but no changes.
It is not possible to know without testing, you will need to test your transform and see if it impacts your cluster or not.
You can do a "join" directly without enrich. Transform source can take an array of indices or patterns. All you need is a "primary key", so a field that exists in both indices under the same name(if that's not the case you can help yourself with aliases or runtime fields).
In the aggregation part you can copy over the fields you want from your "enrichment index" using either a top_metrics aggregation or scripted_metric.
The benefit of the "transform only" approach: An update on either index triggers the update.
Thanks @Hendrik_Muhs and @leandrojmp for your responses. The two document types I'm trying to join are actually part of the same index pattern, they just belong to two different datasets, although they have a similar field that would act as the "primary key". Does transform source need to take an array of two or more indices?
Sorry, I think by me saying the two datasets belong to the same index pattern it made it sound like they were from two different source indices. They belong to the same index, they are just from two different inbound sources so they belong to different log types.
Thanks, @Hendrik_Muhs . I have been able to join those two fields on their key in a transform. I also have a separate question: Is it possible in a search query to filter only where there are name_field type documents have a corresponding key in an int_field type document and retrieve these name_field docs and int_field docs (aka only documents that can potentially be joined on)? There are currently many documents in the Transform where name field is null due to the int_field not being associated with a name_field.
Using a search query you can filter out docs using exists. exists only matches docs which have a value for a field. However as a query applies to both your "left" and "right" index in a join you would drop all docs. The problem seems to be that if name_field is null you probably miss a document from one of the indices to join.
But if I understand the problem correctly you only know after the join whether that's the case. I suggest to solve this with an ingest pipeline that you can attach to the transform dest. I would drop documents that have name_field == null. You find this case in the docs here.
The only other option I see: You keep indexing the docs as is, but use exists as filter when you query your transform destination index in your dashboard/application.
Hi @Hendrik_Muhs , thanks your advices on transform "joins", I have a question building on top of emi_rose's. If you have a doc A as:
... # other fields and values
and then a doc B with
but the desired destination document A+B is:
... # other fields and values
As you can see, I specifically don't want to aggregate by key, because in my use case, several documents with the same my_id could have different key value, yet I want all of them in the same bucket.
Are transforms appropriate for that use case? I feel like I'd have to chain 2 transforms to another "group by" afterward .
To make things event more complex, I have a timestamp in my doc A that I want to use for a date histogram aggregation, where it is not clear to me how to solve that "join". It all sounds very complex for basically "just" a lookup in a key-value pair index that an enrich processor could solve (where new keys are frequently added however), but perhaps I'm not looking at it from the right perspective.