Using elasticsearch to find duplicates in dataset


(Schmurfy) #1

Hello,
I am currently evaluating elasticsearch for a very specific task which is
removing duplicates from a contacts list, from my initial tests it looks
like it would works
but there are still some shadows I hope you can help me with.

What I was planning to do is:

  • load the data from some csv files
  • normalize the fields (phone numbers, addresses)
  • load the data into elasticsearch
  • run a bunch of queries on the data to find/remove/merge the duplicates
  • export the data back into csv

The first thing I am interested into is: do you think elasticsearch is a
good fit for this task ?
I chose it for its ability to quickly search within a dataset with a rather
large set of options and possibilities,
most of them I am sure I don't even know yet.

My last discoveries is facets and I did some tests to find which phone
numbers are present in more than one contacts with a query like this:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["phone_home", "phone_office", "phone_mobile",
"fax"],
"size" : 10
}
}
}
}

It works well but I have some questions I cannot find an answer to:

  • can I get all the results, I tried removing "size" but there is still
    some limit applied. I understand returning a big list is not what you
    want the default behavior to be but still is there a way to force it ?
  • is there a way to restrict the returned values, for example can I get all
    the terms which are present in at least 2 different records ?
    For my current use case I have no interest in terms present in only 1
    records but all I found was changing the terms facet ordering
    to reverse_count which effectively list me all the terms which are
    present once :confused:

Elasticsearch is a really nice project and I am sure I barely scratched the
surface of its possibilities but I am already really happy with it.

Thanks for any help on this.


Delete duplicate docs in ES 1.7
(Otis Gospodnetić) #2

Hi Julien,

We have done this with SolrCloud several months ago. In our experience
Solr worked well for this and I think ES would work just as well.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Sunday, July 22, 2012 8:29:04 AM UTC-4, Julien Ammous wrote:

Hello,
I am currently evaluating elasticsearch for a very specific task which is
removing duplicates from a contacts list, from my initial tests it looks
like it would works
but there are still some shadows I hope you can help me with.

What I was planning to do is:

  • load the data from some csv files
  • normalize the fields (phone numbers, addresses)
  • load the data into elasticsearch
  • run a bunch of queries on the data to find/remove/merge the duplicates
  • export the data back into csv

The first thing I am interested into is: do you think elasticsearch is a
good fit for this task ?
I chose it for its ability to quickly search within a dataset with a
rather large set of options and possibilities,
most of them I am sure I don't even know yet.

My last discoveries is facets and I did some tests to find which phone
numbers are present in more than one contacts with a query like this:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["phone_home", "phone_office", "phone_mobile",
"fax"],
"size" : 10
}
}
}
}

It works well but I have some questions I cannot find an answer to:

  • can I get all the results, I tried removing "size" but there is still
    some limit applied. I understand returning a big list is not what you
    want the default behavior to be but still is there a way to force it ?
  • is there a way to restrict the returned values, for example can I get
    all the terms which are present in at least 2 different records ?
    For my current use case I have no interest in terms present in only 1
    records but all I found was changing the terms facet ordering
    to reverse_count which effectively list me all the terms which are
    present once :confused:

Elasticsearch is a really nice project and I am sure I barely scratched
the surface of its possibilities but I am already really happy with it.

Thanks for any help on this.


(Jörg Prante) #3

Hi Julien,

On Sunday, July 22, 2012 2:29:04 PM UTC+2, Julien Ammous wrote:

Hello,
I am currently evaluating elasticsearch for a very specific task which is
removing duplicates from a contacts list, from my initial tests it looks
like it would works
but there are still some shadows I hope you can help me with.

What I was planning to do is:

  • load the data from some csv files
  • normalize the fields (phone numbers, addresses)
  • load the data into elasticsearch
  • run a bunch of queries on the data to find/remove/merge the duplicates
  • export the data back into csv

The first thing I am interested into is: do you think elasticsearch is a
good fit for this task ?

yes

I chose it for its ability to quickly search within a dataset with a
rather large set of options and possibilities,
most of them I am sure I don't even know yet.

My last discoveries is facets and I did some tests to find which phone
numbers are present in more than one contacts with a query like this:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["phone_home", "phone_office", "phone_mobile",
"fax"],
"size" : 10
}
}
}
}

It works well but I have some questions I cannot find an answer to:

  • can I get all the results, I tried removing "size" but there is still
    some limit applied. I understand returning a big list is not what you
    want the default behavior to be but still is there a way to force it ?

Result sizes are always "naturally" limited, just because of the limits of
JVM heap (you would have to move all docs into memory) and the enormous
amount of time and space you would have to spend on performing such a brute
force approach. This is not efficient.

Therefore, a cursor-like efficient approach is available. If you want to
scan though all documents of a result set, see the Scan Search API
http://www.elasticsearch.org/guide/reference/api/search/search-type.html

and the result set scrolling
http://www.elasticsearch.org/guide/reference/api/search/scroll.html

  • is there a way to restrict the returned values, for example can I get
    all the terms which are present in at least 2 different records ?
    For my current use case I have no interest in terms present in only 1
    records but all I found was changing the terms facet ordering
    to reverse_count which effectively list me all the terms which are
    present once :confused:

You mean something like SQL's "having" clause? See the Term stats Facet
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.html

Best regards,

Jörg


(Schmurfy) #4

Thanks for pointing the scan search type I will look into that,
Elasticsearch is so different than the databases I know that I don't even
knew what to search for.

Yes what I am looking for is something similar to "HAVING COUNT(*) > 1" in
SQL, I saw the term_stats facet and thought it may be what I was looking
for but
I have no clue as to how it can be used, the documentation is rather thin
for this, could you show me an example to use as a starting point ?

On 23 July 2012 14:23, Jörg Prante joergprante@gmail.com wrote:

Hi Julien,

On Sunday, July 22, 2012 2:29:04 PM UTC+2, Julien Ammous wrote:

Hello,
I am currently evaluating elasticsearch for a very specific task which is
removing duplicates from a contacts list, from my initial tests it looks
like it would works
but there are still some shadows I hope you can help me with.

What I was planning to do is:

  • load the data from some csv files
  • normalize the fields (phone numbers, addresses)
  • load the data into elasticsearch
  • run a bunch of queries on the data to find/remove/merge the duplicates
  • export the data back into csv

The first thing I am interested into is: do you think elasticsearch is a
good fit for this task ?

yes

I chose it for its ability to quickly search within a dataset with a
rather large set of options and possibilities,
most of them I am sure I don't even know yet.

My last discoveries is facets and I did some tests to find which phone
numbers are present in more than one contacts with a query like this:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["phone_home", "phone_office", "phone_mobile",
"fax"],
"size" : 10
}
}
}
}

It works well but I have some questions I cannot find an answer to:

  • can I get all the results, I tried removing "size" but there is still
    some limit applied. I understand returning a big list is not what you
    want the default behavior to be but still is there a way to force it ?

Result sizes are always "naturally" limited, just because of the limits of
JVM heap (you would have to move all docs into memory) and the enormous
amount of time and space you would have to spend on performing such a brute
force approach. This is not efficient.

Therefore, a cursor-like efficient approach is available. If you want to
scan though all documents of a result set, see the Scan Search API
http://www.elasticsearch.org/guide/reference/api/search/search-type.html

and the result set scrolling
http://www.elasticsearch.org/guide/reference/api/search/scroll.html

  • is there a way to restrict the returned values, for example can I get
    all the terms which are present in at least 2 different records ?
    For my current use case I have no interest in terms present in only 1
    records but all I found was changing the terms facet ordering
    to reverse_count which effectively list me all the terms which are
    present once :confused:

You mean something like SQL's "having" clause? See the Term stats Facet

http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.html

Best regards,

Jörg


(Jörg Prante) #5

Hi Julien,

I tried the terms stats facet but with little success.

Something like "having count" is only possible by doing it manually, by
iterating through the terms facet members and picking up the members with a
count matching the condition you want.

The term stats facet is only working with numeric fields (I don't know
exactly the reason) and is not suited well for your task. Sorry for
misleading. So I think the terms facet is best when used in the following
way:

"facets" : {
"tag_stats" : {
"terms" : {
"field" : "tag",
"size: 100,
"order" : "reverse_count"
}
}
}

The "reverse_count" will deliver all tags beginning with a single
occurence, then tags with two occurrences and so on. If you iterate and
want all tags with an occurence of 2 or more, you just skip the single
occurrence tags and take the rest of the members. Take care that the facet
member size is by default 10, it should be adjusted to your needs, mostly
by setting it much higher (in the example it is set to 100).

Maybe it is a good idea to open an issue for facet member limiting, e.g.
with additional parameters like "from_count", "to_count" or something.
Thoughts?

Best regards

Jörg

On Monday, July 23, 2012 4:16:53 PM UTC+2, Julien Ammous wrote:

Thanks for pointing the scan search type I will look into that,
Elasticsearch is so different than the databases I know that I don't even
knew what to search for.

Yes what I am looking for is something similar to "HAVING COUNT(*) > 1" in
SQL, I saw the term_stats facet and thought it may be what I was looking
for but
I have no clue as to how it can be used, the documentation is rather thin
for this, could you show me an example to use as a starting point ?


(David Pilato) #6

Maybe it is a good idea to open an issue for facet member limiting, e.g. with
additional parameters
like "from_count", "to_count" or something. Thoughts?

Sounds like a scroll on facet feature or a facet pagination feature. Nice idea.
It could answer to "give the 10 first facets for that term" and then "give me
the next 10 facets"...

But, I'm outside of this thread scope. :wink:

David.

Le 24 juillet 2012 à 10:05, "Jörg Prante" joergprante@gmail.com a écrit :

Hi Julien,

I tried the terms stats facet but with little success.

Something like "having count" is only possible by doing it manually, by
iterating through the terms facet members and picking up the members with a
count matching the condition you want.

The term stats facet is only working with numeric fields (I don't know
exactly the reason) and is not suited well for your task. Sorry for
misleading. So I think the terms facet is best when used in the following way:

"facets" : {
"tag_stats" : {
"terms" : {
"field" : "tag",
"size: 100,
"order" : "reverse_count"
}
}
}

The "reverse_count" will deliver all tags beginning with a single occurence,
then tags with two occurrences and so on. If you iterate and want all tags
with an occurence of 2 or more, you just skip the single occurrence tags and
take the rest of the members. Take care that the facet member size is by
default 10, it should be adjusted to your needs, mostly by setting it much
higher (in the example it is set to 100).

Maybe it is a good idea to open an issue for facet member limiting, e.g. with
additional parameters like "from_count", "to_count" or something. Thoughts?

Best regards

Jörg

On Monday, July 23, 2012 4:16:53 PM UTC+2, Julien Ammous wrote:

Thanks for pointing the scan search type I will look into that,
Elasticsearch is so different than the databases I know that I don't even
knew what to search for.

Yes what I am looking for is something similar to "HAVING COUNT(*) > 1"
in SQL, I saw the term_stats facet and thought it may be what I was looking
for but
I have no clue as to how it can be used, the documentation is rather thin
for this, could you show me an example to use as a starting point ?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(Schmurfy) #7

I noticed that scan search types are not available with faceted searchs too
bad.
Hopefully for my current usecase I will be able to request all records by
specifying a big enough number but it might still be a nice feature to have.

Thanks for your help.
Elasticsearch is really an interesting tool and the more I learn about it
the more I like it !

On 24 July 2012 10:55, David Pilato david@pilato.fr wrote:

**

Maybe it is a good idea to open an issue for facet member limiting,
e.g. with additional parameters
like "from_count", "to_count" or something. Thoughts?

Sounds like a scroll on facet feature or a facet pagination feature. Nice
idea.

It could answer to "give the 10 first facets for that term" and then "give
me the next 10 facets"...

But, I'm outside of this thread scope. :wink:

David.

Le 24 juillet 2012 à 10:05, "Jörg Prante" joergprante@gmail.com a
écrit :

Hi Julien,

I tried the terms stats facet but with little success.

Something like "having count" is only possible by doing it manually, by
iterating through the terms facet members and picking up the members with a
count matching the condition you want.

The term stats facet is only working with numeric fields (I don't know
exactly the reason) and is not suited well for your task. Sorry for
misleading. So I think the terms facet is best when used in the following
way:

"facets" : {
"tag_stats" : {
"terms" : {
"field" : "tag",
"size: 100,
"order" : "reverse_count"
}
}
}

The "reverse_count" will deliver all tags beginning with a single
occurence, then tags with two occurrences and so on. If you iterate and
want all tags with an occurence of 2 or more, you just skip the single
occurrence tags and take the rest of the members. Take care that the facet
member size is by default 10, it should be adjusted to your needs, mostly
by setting it much higher (in the example it is set to 100).

Maybe it is a good idea to open an issue for facet member limiting, e.g.
with additional parameters like "from_count", "to_count" or something.
Thoughts?

Best regards

Jörg

On Monday, July 23, 2012 4:16:53 PM UTC+2, Julien Ammous wrote:

Thanks for pointing the scan search type I will look into that,
Elasticsearch is so different than the databases I know that I don't even
knew what to search for.

Yes what I am looking for is something similar to "HAVING COUNT(*) > 1" in
SQL, I saw the term_stats facet and thought it may be what I was looking
for but
I have no clue as to how it can be used, the documentation is rather thin
for this, could you show me an example to use as a starting point ?

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


(system) #8