I have the following issue, I have an index with denormalized data.. the
search returns the duplicated data and it seems there is no way in an
elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to
remove the duplicates using elasticsearch API?
I have the following issue, I have an index with denormalized data.. the
search returns the duplicated data and it seems there is no way in an
elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to
remove the duplicates using elasticsearch API?
1.) add a field into the index with a unique hash value (e.g. MD5/SHA1) corresponding to the indexed fields and filter out double entries at query time after you got the results from ES.
2.) don't add a hash field to index but calculate the hash for each returned doc at query time. This could hurt performance badly, depending how many hits you like to show and how big the docs are.
3.) similar to 1.) add the hash field but post process your data after importing it to ES and remove all double entries via iterating over all id fields and using a more like this query for the hash field. For this you probably need to set the ids at insertion time and have an external reference of these somewhere else, e.g. the SQL DB you imported the data from
4.) similar to 1.) add the hash field and check duplicates with the more like this query before inserting new docs. This gives a very slow but probably safe import in context to duplicates
bye,
Daniel.
Am 19.06.2012 um 17:08 schrieb David:
I have the following issue, I have an index with denormalized data.. the search returns the duplicated data and it seems there is no way in an elasticsearch query
to distinct the search results to remove the duplicates. Is there a way to remove the duplicates using elasticsearch API?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.