Finding duplicate documents or its count based on some field names


(narinder.izap) #1

Hi All,

    I need to know, if Elasticsearch has some feature to find the 

duplicate documents or documents counts if I want to see how many documents
are having same values against two or more fields. I can do that for one
field using facets, but what if I need to do it against more than one
field. For Example : Suppose I have following doc in Es

doc 1 :

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same country,
So that doc1 and doc4 are duplicate for me, OR In simple words, I want
to have unique clause on a single fields or composite fields key. Is this
possible??

Please let me know if its possible using Elasticsearch, as I think it is
very important feature for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1988ddb0-9bae-4263-b262-3c84f4445fa8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

If you can use 1.0.0.Beta2, aggregations might be a solution.

Demo:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGExLpeiBcMxcemFDo9oqOLjCj9zq4Db8_mNrp8n%2BV3-w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #3

More Like This could work, especially if using non-analyzed fields:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-more-like-this.html

--
Ivan

On Sat, Dec 28, 2013 at 5:14 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

If you can use 1.0.0.Beta2, aggregations might be a solution.

Demo:

https://gist.github.com/jprante/8159379

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGExLpeiBcMxcemFDo9oqOLjCj9zq4Db8_mNrp8n%2BV3-w%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAmk5sUPgDd_1eA_pF%2BU%2B0ud0C9n4xVgsmHZmjvFM__gg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #4

Hi,

You can check this :

Le samedi 28 décembre 2013 06:16:16 UTC+1, Narinder Kaur a écrit :

Hi All,

    I need to know, if Elasticsearch has some feature to find the 

duplicate documents or documents counts if I want to see how many documents
are having same values against two or more fields. I can do that for one
field using facets, but what if I need to do it against more than one
field. For Example : Suppose I have following doc in Es

doc 1 :

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same country,
So that doc1 and doc4 are duplicate for me, OR In simple words, I want
to have unique clause on a single fields or composite fields key. Is this
possible??

Please let me know if its possible using Elasticsearch, as I think it is
very important feature for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d4ebe89-f777-499c-a215-f794c33d88a3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #5

Hey,

another very simple solution could be a terms facet, using a script field,
which simply concatenates the two fields you want to check for. See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_term_scripts

--Alex

On Tue, Dec 31, 2013 at 1:57 PM, Yann Barraud yann.barraud@gmail.comwrote:

Hi,

You can check this :

http://github.com/yannbrrd/elasticsearch-entity-resolution

Le samedi 28 décembre 2013 06:16:16 UTC+1, Narinder Kaur a écrit :

Hi All,

    I need to know, if Elasticsearch has some feature to find the

duplicate documents or documents counts if I want to see how many documents
are having same values against two or more fields. I can do that for one
field using facets, but what if I need to do it against more than one
field. For Example : Suppose I have following doc in Es

doc 1 :

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same
country, So that doc1 and doc4 are duplicate for me, OR In simple words,
I want to have unique clause on a single fields or composite fields key. Is
this possible??

Please let me know if its possible using Elasticsearch, as I think it is
very important feature for me.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7d4ebe89-f777-499c-a215-f794c33d88a3%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9u46nmj7Kzx0WZ0zUJ7xeT4e00HAh8Ce7j5DrnVY4uEg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6