I'm asking for an idea or approach to solve the following business problem:
Two stream data sources (A and B) continuously ingesting events into two separate indices (A and B) in Elasticsearch. Each of them has a unique _id.
My goal is to merge Index A with Index B based on the unique _id field of the document and split the result into three buckets:
Bucket 1: All documents from Index A without a matching _id from Index B
Bucket 2: All documents from Index B without a matching _id from Index A
Bucket 3: All documents which has been matched but shall only stored as a single document.
Keep in mind, the data might be incomplete at a certain time but documents may match in a later stage due to continuous document ingestion.
May I implement the explained scenario with Logstash (how?) or do I need other components like kafka and/or message queueing? What is the most convenient approach? Any experiences?
Thanks for you help.