Removing dup hits


(es_learner) #1

Hello,

I just joined this group today to learn more about ES. We have been using ES for many months and it has been great for our application - making users' comments(including tweets) searchable.

We are running 0.19.2 but for the next version of our application, we plan on 0.19.9.

Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters. 'fast' holds latest 'n' comments which get moved to the 'archive' cluster based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment when someone replied to it or modified it. This introduces a duplicate hit when we search across the 2 clusters.

Possibe solutions:

  1. Delete those comments from archive when they are recreated into the fast cluster. This ensures each comment doc is unique across the 2 clusters. Cons: extra load on the archive cluster(search and delete)
  2. Post-process the hits and remove dups(this is our current implementation). Cons: we can 'lose' 50% of the total hits unless we replenish with another query (with cursor) but when do we stop. Also this is client-side deduping.
  3. Get ES to do the deduping at the server side.

Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?

Thanks.


(sujoysett) #2

Hi,

Writing a river might solve your requirement of doing things on the server
side ......

In an ES application I recently worked with, the requirement was to apply
some additional analysis on an index and replicate the docs (with
additional info) on a sister index. Now the primary was a growing index,
and a custom river was developed that checks the difference between primary
and secondary once in a certain interval, and processes any data which is
in primary and not in secondary. It helped a lot in moving processing
activity load on the ES server from the client side.

Sujoy.

On Friday, September 21, 2012 6:02:01 AM UTC+5:30, es_learner wrote:

Hello,

I just joined this group today to learn more about ES. We have been using
ES for many months and it has been great for our application - making
users'
comments(including tweets) searchable.

We are running 0.19.2 but for the next version of our application, we plan
on 0.19.9.

Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters.
'fast' holds latest 'n' comments which get moved to the 'archive' cluster
based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment
when someone replied to it or modified it. This introduces a duplicate
hit
when we search across the 2 clusters.

Possibe solutions:

  1. Delete those comments from archive when they are recreated into the
    fast
    cluster. This ensures each comment doc is unique across the 2 clusters.
    Cons: extra load on the archive cluster(search and delete)
  2. Post-process the hits and remove dups(this is our current
    implementation). Cons: we can 'lose' 50% of the total hits unless we
    replenish with another query (with cursor) but when do we stop. Also this
    is client-side deduping.
  3. Get ES to do the deduping at the server side.

Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?

Thanks.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Removing-dup-hits-tp4022952.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--


(es_learner) #3

Thanks Sujoy. I'm not familiar with ES rivers but will look into that.

I'm still open to other suggestions :slight_smile:


(system) #4