We are working on a data processing pipeline that involves multiple transformations. Specifically, we have a use case where the first transformation runs and calculates documents for various systems, including system1. In this transformation, we categorize documents based on their presence as either "Primary only", "Secondary only", or "Both".
In our second transformation, we need to calculate or process documents again for system1. I’m concerned about how changes from "Primary only" to "Both" from the first transformation will be managed. Specifically:
For example- We have 10 logs havings document as primary only now ,if for 2 logs the status changes to both from the first transform.
Since the documents with the "Primary only" status are already indexed, what’s the best approach to ensure these documents are properly updated or removed when their status changes to "Both"? We want to ensure that only "Primary only" documents are retained in the second transformation output.