Handling unique field (Other than the ID)

(Matt) #1

Hi All

Need help/suggestions on how to handle this

From what I understand reading documentation and previous forum postings, ES supports only the ID field as an unique value field. Pls correct me if I am wrong.

We are trying to index web pages and want to avoid duplicate indexing of URLs. It can be done by making the URL, the ID of the document, however because of some constraints with the usage of ID in rest of our application, we cannot go with URL as the ID. We need an UUID as the ID of the document.

We are considering below options - which one would be better? or are they all bad ideas and there is another way?

  1. Keep URL as a separate field and then decide to insert/update by doing a lookup on the URL each time a new document is indexed
  2. Maintain another index with just URL and the UUID mapping. Do a lookup in this to find the UUID for incoming URL. (each time a new document comes in)
  3. Have a batch job that looks for duplicates (using aggregation). But in this case we will have duplicate documents till the batch job kicks in

We will be indexing 2-3 million document every 24 hrs and the index will have around 100 million documents in total.


(Alexander Reelsen) #2

If you have a good hashing algorithm (fast, few collisions as possible) on the client side, you could hash the URL and use that as the key of the document and then you would not need to have any lookup mechanism on the Elasticsearch side. Isnt that an option?

(Matt) #3

@spinscale Thanks for the suggestion - will explore this option as well.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.