How to filter out duplicate documents across multiple types?


(anand) #1

We have certain documents stored across multiple types with translated
values, for example, US and ES types has same document but with different
values in title fields.
Example:
US:
{
"title":"Manning: Spring in Action, Third Edition"
}

ES:
{
"title":"Manning : Primavera en Acción , Tercera Edición"
}

So, when I search for "Manning" across all types, I only want one document.

I can certainly remove the duplicates in my code, but then I can not use
pagination.

Any one know how to remove the duplicates?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/109a396f-8032-4c03-be4b-b02e004507a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Delete duplicate docs in ES 1.7
(vineeth mohan-2) #2

Hello Anand ,

I dont see any direct way to do this from the query.

The way i have in my mind goes like this

  1. Identify duplicates while indexing. and mark the duplicate feed as
    duplicate. A field names "isDuplicate" : "true/false" would be the best.
  2. While doing search filter out all duplicates.

If the type name is not very important to you , i would advice to store the
type name as a separate field and store all documents on the same type.
This way , you can make the indexing of duplicate elements atmoic using
upserts -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_upserts.html

Thanks
Vineeth

On Fri, Sep 5, 2014 at 11:00 AM, Anand Natarajan anand.7719@gmail.com
wrote:

We have certain documents stored across multiple types with translated
values, for example, US and ES types has same document but with different
values in title fields.
Example:
US:
{
"title":"Manning: Spring in Action, Third Edition"
}

ES:
{
"title":"Manning : Primavera en Acción , Tercera Edición"
}

So, when I search for "Manning" across all types, I only want one
document.

I can certainly remove the duplicates in my code, but then I can not use
pagination.

Any one know how to remove the duplicates?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/109a396f-8032-4c03-be4b-b02e004507a2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/109a396f-8032-4c03-be4b-b02e004507a2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5mmi_%2BoV5qzeEDGHVaMESiitw9K8iYnL0TXkonR_%3D11Sg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(anand) #3

Thanks Vineet,
Well, I wanted to search across the types, i.e US and ES but only return
one document not 2.
The problem with the approach you suggested is that search is then
limited to documents with isDuplicate=true/false

On Friday, September 5, 2014 2:46:20 AM UTC-5, vineeth mohan wrote:

Hello Anand ,

I dont see any direct way to do this from the query.

The way i have in my mind goes like this

  1. Identify duplicates while indexing. and mark the duplicate feed as
    duplicate. A field names "isDuplicate" : "true/false" would be the best.
  2. While doing search filter out all duplicates.

If the type name is not very important to you , i would advice to store
the type name as a separate field and store all documents on the same type.
This way , you can make the indexing of duplicate elements atmoic using
upserts -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_upserts.html

Thanks
Vineeth

On Fri, Sep 5, 2014 at 11:00 AM, Anand Natarajan <anand...@gmail.com
<javascript:>> wrote:

We have certain documents stored across multiple types with translated
values, for example, US and ES types has same document but with different
values in title fields.
Example:
US:
{
"title":"Manning: Spring in Action, Third Edition"
}

ES:
{
"title":"Manning : Primavera en Acción , Tercera Edición"
}

So, when I search for "Manning" across all types, I only want one
document.

I can certainly remove the duplicates in my code, but then I can not use
pagination.

Any one know how to remove the duplicates?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/109a396f-8032-4c03-be4b-b02e004507a2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/109a396f-8032-4c03-be4b-b02e004507a2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5f9ce491-9816-4bcf-be92-3d8e89a8ead0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4