Duplicate results in resultset


(David-3) #1

I have the following issue, I have an index with denormalized data.. the
search returns the duplicated data and it seems there is no way in an
elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to
remove the duplicates using elasticsearch API?

Thanks,
David


(Shay Banon) #2

What constitues a duplicate? In general though, no, duplicates between docs
in a single search request can't be filtered out, at least not easily.

On Tue, Jun 19, 2012 at 5:08 PM, David davidrockett@gmail.com wrote:

I have the following issue, I have an index with denormalized data.. the
search returns the duplicated data and it seems there is no way in an
elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to
remove the duplicates using elasticsearch API?

Thanks,
David


(Daniel Schnell) #3

What you could do depending on your data:

1.) add a field into the index with a unique hash value (e.g. MD5/SHA1) corresponding to the indexed fields and filter out double entries at query time after you got the results from ES.
2.) don't add a hash field to index but calculate the hash for each returned doc at query time. This could hurt performance badly, depending how many hits you like to show and how big the docs are.
3.) similar to 1.) add the hash field but post process your data after importing it to ES and remove all double entries via iterating over all id fields and using a more like this query for the hash field. For this you probably need to set the ids at insertion time and have an external reference of these somewhere else, e.g. the SQL DB you imported the data from
4.) similar to 1.) add the hash field and check duplicates with the more like this query before inserting new docs. This gives a very slow but probably safe import in context to duplicates

bye,
Daniel.

Am 19.06.2012 um 17:08 schrieb David:

I have the following issue, I have an index with denormalized data.. the search returns the duplicated data and it seems there is no way in an elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to remove the duplicates using elasticsearch API?

Thanks,
David


(akbari) #4

Daniel Schnell, Can you write an example please, I have the same problem.


(system) #5