Need help/suggestions on how to handle this
From what I understand reading documentation and previous forum postings, ES supports only the ID field as an unique value field. Pls correct me if I am wrong.
We are trying to index web pages and want to avoid duplicate indexing of URLs. It can be done by making the URL, the ID of the document, however because of some constraints with the usage of ID in rest of our application, we cannot go with URL as the ID. We need an UUID as the ID of the document.
We are considering below options - which one would be better? or are they all bad ideas and there is another way?
- Keep URL as a separate field and then decide to insert/update by doing a lookup on the URL each time a new document is indexed
- Maintain another index with just URL and the UUID mapping. Do a lookup in this to find the UUID for incoming URL. (each time a new document comes in)
- Have a batch job that looks for duplicates (using aggregation). But in this case we will have duplicate documents till the batch job kicks in
We will be indexing 2-3 million document every 24 hrs and the index will have around 100 million documents in total.