Removing dup hits

es_learner · September 21, 2012, 12:31am

Hello,

I just joined this group today to learn more about ES. We have been using ES for many months and it has been great for our application - making users' comments(including tweets) searchable.

We are running 0.19.2 but for the next version of our application, we plan on 0.19.9.

Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters. 'fast' holds latest 'n' comments which get moved to the 'archive' cluster based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment when someone replied to it or modified it. This introduces a duplicate hit when we search across the 2 clusters.

Possibe solutions:

Delete those comments from archive when they are recreated into the fast cluster. This ensures each comment doc is unique across the 2 clusters. Cons: extra load on the archive cluster(search and delete)
Post-process the hits and remove dups(this is our current implementation). Cons: we can 'lose' 50% of the total hits unless we replenish with another query (with cursor) but when do we stop. Also this is client-side deduping.
Get ES to do the deduping at the server side.

Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?

Thanks.

sujoysett · September 21, 2012, 12:40pm

Hi,

Writing a river might solve your requirement of doing things on the server
side ......

In an ES application I recently worked with, the requirement was to apply
some additional analysis on an index and replicate the docs (with
additional info) on a sister index. Now the primary was a growing index,
and a custom river was developed that checks the difference between primary
and secondary once in a certain interval, and processes any data which is
in primary and not in secondary. It helped a lot in moving processing
activity load on the ES server from the client side.

Sujoy.

On Friday, September 21, 2012 6:02:01 AM UTC+5:30, es_learner wrote:

Hello,

I just joined this group today to learn more about ES. We have been using
ES for many months and it has been great for our application - making
users'
comments(including tweets) searchable.

We are running 0.19.2 but for the next version of our application, we plan
on 0.19.9.

Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters.
'fast' holds latest 'n' comments which get moved to the 'archive' cluster
based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment
when someone replied to it or modified it. This introduces a duplicate
hit
when we search across the 2 clusters.

Possibe solutions:

Delete those comments from archive when they are recreated into the
fast
cluster. This ensures each comment doc is unique across the 2 clusters.
Cons: extra load on the archive cluster(search and delete)

Post-process the hits and remove dups(this is our current
implementation). Cons: we can 'lose' 50% of the total hits unless we
replenish with another query (with cursor) but when do we stop. Also this
is client-side deduping.

Get ES to do the deduping at the server side.

Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Removing-dup-hits-tp4022952.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

es_learner · September 21, 2012, 4:30pm

Thanks Sujoy. I'm not familiar with ES rivers but will look into that.

I'm still open to other suggestions

Topic		Replies	Views
Is there any way to remove duplicated search result in ES? Elasticsearch	4	5040	July 6, 2017
Indexing-time document deduplication Elasticsearch	6	2593	July 6, 2017
Duplicate results in resultset Elasticsearch	4	3038	July 6, 2017
Any idea to remove the duplicates from the search results? Elasticsearch	2	2542	July 6, 2017
Document Clustering Elasticsearch	3	1178	July 6, 2017

Removing dup hits

Related topics