Handling unique field (Other than the ID)

mattkallo · August 7, 2017, 2:33am

Hi All

Need help/suggestions on how to handle this

From what I understand reading documentation and previous forum postings, ES supports only the ID field as an unique value field. Pls correct me if I am wrong.

We are trying to index web pages and want to avoid duplicate indexing of URLs. It can be done by making the URL, the ID of the document, however because of some constraints with the usage of ID in rest of our application, we cannot go with URL as the ID. We need an UUID as the ID of the document.

We are considering below options - which one would be better? or are they all bad ideas and there is another way?

Keep URL as a separate field and then decide to insert/update by doing a lookup on the URL each time a new document is indexed
Maintain another index with just URL and the UUID mapping. Do a lookup in this to find the UUID for incoming URL. (each time a new document comes in)
Have a batch job that looks for duplicates (using aggregation). But in this case we will have duplicate documents till the batch job kicks in

We will be indexing 2-3 million document every 24 hrs and the index will have around 100 million documents in total.

Thnx

spinscale · August 7, 2017, 11:22am

If you have a good hashing algorithm (fast, few collisions as possible) on the client side, you could hash the URL and use that as the key of the document and then you would not need to have any lookup mechanism on the Elasticsearch side. Isnt that an option?

mattkallo · August 9, 2017, 4:35am

@spinscale Thanks for the suggestion - will explore this option as well.

system · September 6, 2017, 4:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Maintain a unique field while indexing - equivalent to a UNIQUE INDEX in a relational database Elasticsearch	5	4358	July 5, 2017
Is it possible to make document_id unique among indices? Elasticsearch	5	759	March 1, 2018
Unique Constraint? Elasticsearch	3	2483	July 6, 2017
Best practice for handling _ids in get and search results Elasticsearch	6	862	December 28, 2021
What is the best way to store unique values of a field Elasticsearch	7	3958	August 16, 2021

Handling unique field (Other than the ID)

Related topics