Join two indices (many-to-many relationships)

bianca_s · July 26, 2024, 2:17pm

Hi all,

we have the following use case: We want to create an index that holds the relationships of producer - topic - consumer.

Basically all relationships here are many-to-many:

Multiple producers can write to the same topic
One producer can write to multiple topics
Multiple consumers can read from the same topic
One consumer can read from multiple topics

The producer- topic relationship is available as a CSV file, like this:
producer;topic
producer-a;topic-1
producer-a;topic-2
producer-b;topic-1
producer-b;topic-3
producer-c;topic-4
...
We could directly create an index with producer-topic relationships from it.

The topic - consumer relationships are only available implicitly as we're constantly ingesting consumer rates per consumer and topic. So by grouping this index by consumer and topic we would get all pairs of consumers and topics.

We now have to find a way how combine our producer-topic pairs with the (implicit) information about consumer-topic pairs. I had a few ideas but am not sure if any of them will actually work.

The desired output would be one document per producer - topic - consumer.

Note: We won't have to do this regularly as the relationships between producers - topics - consumers will only change every other month. So it can for example involve some degree of manual work.

Idea 1
I was thinking about using a transform, to create an index with topic-consumer pairs. This should work.

What I am not sure about how we can then combine the producer-topic index with the consumer-topic index? I was thinking about enrich pipelines (but I think this won't work for a many-to-many relationship) and logstash (but I don't have much experience with logstash so not sure if logstash is capable of it)

Idea 2
Use purely Logstash: Ingest CSV file with producer-topic pairs via logstash. In the logstash pipeline access elasticsearch index with consumer-topic rates: Do a terms aggregation on it to get consumer-topic pairs. Somehow enrich the producer-topic pairs with the consumers.

I know that logstash can be used to enrich ingested data with data from an existing index. But my doubts here are:

We won't simply have to add a new field to the incoming documents, we basically also have to split them whenever the current topic is being consumed by multiple consumers, as we want one document per producer-topic-consumer triple.
We don't want to enrich with the results of a query but with the results of an aggregation. Not sure if that is possible.

Very much appreciate any input on my thoughts or alternative approaches. Thank you!

ashishtiwari1993 · July 30, 2024, 6:53am

Hi @bianca_s, To combining data, you can give a try to enrich processor where you can lookup on another index for specific information and merge with your current data.