I have read two other posts related to this but I still have not found a solution to my issue. I am ingesting large amounts of data and the ingestion rate is unpredictable. I have implemented ILM policies to rollover an index when it reaches a certain size. This is to avoid having a very large index and a bunch of smaller ones and avoiding hot shards (overall even data load). The two issues I am having are:
Upon ingesting a large dataset that contains duplicates, I ingest a document and extract entities and write this data to ES. The ingestion continues, after a certain amount of data the index rolls over. Now the duplicate document appears from above in the ingestion but ES does not realize it is a duplicate because the write index does not know about the previous (now) read only index. Is there any way to solve this without doing a read on the alias of the index and then writing? This is not valid solution because it will incur too much overhead and I cannot use the bulk api ingestion (I believe).
Updating a document that does not exist on the write index. Is there any other solution besides querying the alias to find the document that needs to be updated?
Thanks for any help/suggestions.