Entity/Identity resolution


(Yann Barraud) #1

Hi,

I'm currently working on search engines, data cleaning and so on these last
days. The challenge I'm facing right now is explaining that a search engine
((ie. ElasticSearch http://www.elasticsearch.orgin this case) on its own
can not be used for identity resolution. Lucene posts made things easier (
http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
). http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

I've been playing with Duke http://code.google.com/p/dukeproject also,
for batch data deduplication. It's been very powerful, and covering
requirements for batch needs.

Now I'm wondering if there is not an opportunity to merge at some point the
two projets to get some fast live identity resolution service.

I'd say :

  1. duke delegates data analysis & indexing to ElasticSearch
    http://www.elasticsearch.org(as they both rely on Lucene indexes)
  2. duke http://code.google.com/p/duketurns into an ES plugin to get
    records matching query with Bayesian probability as an output.

What do you guys think about it ?

Regards,
Yann Barraud

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

+1

I developed JDBC/CSV/RDF plugins with the intent of growing them later
into a data cleaning / data transformation platform. There are many good
things in Duke. I am developing an ES based engine for transforming
large amounts of bibliographic data. Bibliograhic data is organized in
graphs so I threw them over the fence into ES, and by simple term
queries I can build connected graphs, using ES JSON sources for grouping
algorithms, which I like to see them evolved into JSON-LD. Also I
developed domain-specific (german language based) add-ons and match keys
for processing unlinked bibliographic data (for matching authority data,
author names e.g., for academic papers).

I haven't analyzed the code how Duke uses Lucene but I think it is a
very cool idea to use Duke code to become an ES app. GIventhe
presentation in http://de.slideshare.net/larsga/deduplication there
could be much improvement, for speed and versatility.

For being a generic instrument, I am not sure if a single ES plugin can
cover all the needs of the users out there, I think something like a
simple dedup REST API hiding a river and a transformation step could be
a start, to help (non-Java) developers setting up ES based dedup
processing very fast. By adding more specific dedup methods, maybe
several plugins could be feasible.

Jörg

Am 12.04.13 09:58, schrieb Yann Barraud:

Hi,

I'm currently working on search engines, data cleaning and so on these
last days. The challenge I'm facing right now is explaining that a
search engine ((ie. ElasticSearch http://www.elasticsearch.orgin
this case) on its own can not be used for identity resolution. Lucene
posts made things easier
(http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F ).

I've been playing with Duke http://code.google.com/p/dukeproject
also, for batch data deduplication. It's been very powerful, and
covering requirements for batch needs.

Now I'm wondering if there is not an opportunity to merge at some
point the two projets to get some fast live identity resolution service.

I'd say :

  1. duke delegates data analysis & indexing to
    ElasticSearchhttp://www.elasticsearch.org(as they both rely on
    Lucene indexes)
  2. duke http://code.google.com/p/duketurns into an ES plugin to get
    records matching query with Bayesian probability as an output.

What do you guys think about it ?

Regards,
Yann Barraud

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #3

Hi Lars,

I fact my sugesstion is a cross post to ES group.

I think having a plugin would couver a simple service.

Let's say I indexed all my data within ES, with the right configuration
(analyzers, filters & co). The plugin could hook the query and add a
Bayesian score for each and every match ES returns. In fact, this would
imply using the class/methods you use to calculate the score on the records
set ES returns...

Le vendredi 12 avril 2013 10:24:46 UTC+2, Lars Marius Garshol a écrit :

  • Yann Barraud

Now I'm wondering if there is not an opportunity to merge at some point
the two projets to get some fast live identity resolution service.

The ElasticSearch people will have to respond to what they think about
this. (I'm the maintainer of Duke, and have no real involvement with
ElasticSearch.)

However, it is actually possible to do live identity resolution with Duke
already. The original project that I developed Duke for uses Duke that way,
to maintain a database of links between duplicate records that's updated as
the source data is updated.

Basically, all you need for this is to implement a data source that
returns new records every time it's called. It's probably fairly easy to
extend the JDBC data source to handle that. (We use a different approach in
the original project.)

I'd say :
• duke delegates data analysis & indexing to ElasticSearch (as they both
rely on Lucene indexes)

I guess the goal here is to exploit ElasticSearch's ability to
incrementally retrieve only new data, and that ES already has indexed up
the data.

It is possible to fit this into Duke, basically by plugging in a new data
source and a new Database implementation. You can reuse bits of the
LuceneDatabase that's already there.

• duke turns into an ES plugin to get records matching query with
Bayesian probability as an output.

I'm not sure that's the right architecture, but it's possible to
implement. The Processor.match(Record, boolean) method essentially does
this already.

--
Lars Marius Garshol | Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #4

Hi Jörg,

The need I have at the moment is to get a simple service that could match
an identity within my data and get a probability match over it.

As far as I understand how Duke works, it first matches records with
Lucene, then calculates Bayesian probability based on the XML configuration.

If I'm right, what I would implement is :

  • A new Query type (ES plugin ?) involving Bayesian probability for each
    matching record, based an a specific type that would be the actual XML
    configuration duke uses.

This would be an answer to Lucene posts I mentioned sooner.

Lars has done a very good job with Duke. I think it is a very valuable
asset, and I regret not having time to get my coding skills back at the
moment, to help more. But if we manage, one way or another, to have both
projects working together, I think we cold start building an incredibly
fast entity resolution service.

Le vendredi 12 avril 2013 10:47:16 UTC+2, Jörg Prante a écrit :

+1

I developed JDBC/CSV/RDF plugins with the intent of growing them later
into a data cleaning / data transformation platform. There are many good
things in Duke. I am developing an ES based engine for transforming
large amounts of bibliographic data. Bibliograhic data is organized in
graphs so I threw them over the fence into ES, and by simple term
queries I can build connected graphs, using ES JSON sources for grouping
algorithms, which I like to see them evolved into JSON-LD. Also I
developed domain-specific (german language based) add-ons and match keys
for processing unlinked bibliographic data (for matching authority data,
author names e.g., for academic papers).

I haven't analyzed the code how Duke uses Lucene but I think it is a
very cool idea to use Duke code to become an ES app. GIventhe
presentation in http://de.slideshare.net/larsga/deduplication there
could be much improvement, for speed and versatility.

For being a generic instrument, I am not sure if a single ES plugin can
cover all the needs of the users out there, I think something like a
simple dedup REST API hiding a river and a transformation step could be
a start, to help (non-Java) developers setting up ES based dedup
processing very fast. By adding more specific dedup methods, maybe
several plugins could be feasible.

Jörg

Am 12.04.13 09:58, schrieb Yann Barraud:

Hi,

I'm currently working on search engines, data cleaning and so on these
last days. The challenge I'm facing right now is explaining that a
search engine ((ie. ElasticSearch http://www.elasticsearch.orgin
this case) on its own can not be used for identity resolution. Lucene
posts made things easier
(http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F).

I've been playing with Duke http://code.google.com/p/dukeproject
also, for batch data deduplication. It's been very powerful, and
covering requirements for batch needs.

Now I'm wondering if there is not an opportunity to merge at some
point the two projets to get some fast live identity resolution service.

I'd say :

  1. duke delegates data analysis & indexing to
    ElasticSearchhttp://www.elasticsearch.org(as they both rely on
    Lucene indexes)
  2. duke http://code.google.com/p/duketurns into an ES plugin to get
    records matching query with Bayesian probability as an output.

What do you guys think about it ?

Regards,
Yann Barraud

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

FYI there is activity in the area of text classification (which is a
generalization of dedup) in the Lucene community, see

and a simple bayesian classifier has been added recently to the Lucene
codebase

http://lucene.apache.org/core/4_2_0/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html

I think this can be exposed by ES API as well.

Jörg

Am 12.04.13 11:13, schrieb Yann Barraud:

Hi Jörg,

The need I have at the moment is to get a simple service that could
match an identity within my data and get a probability match over it.

As far as I understand how Duke works, it first matches records with
Lucene, then calculates Bayesian probability based on the XML
configuration.

If I'm right, what I would implement is :

  • A new Query type (ES plugin ?) involving Bayesian probability for
    each matching record, based an a specific type that would be the
    actual XML configuration duke uses.

This would be an answer to Lucene posts I mentioned sooner.

Lars has done a very good job with Duke. I think it is a very valuable
asset, and I regret not having time to get my coding skills back at
the moment, to help more. But if we manage, one way or another, to
have both projects working together, I think we cold start building an
incredibly fast entity resolution service.

Le vendredi 12 avril 2013 10:47:16 UTC+2, Jörg Prante a écrit :

+1

I developed JDBC/CSV/RDF plugins with the intent of growing them
later
into a data cleaning / data transformation platform. There are
many good
things in Duke. I am developing an ES based engine for transforming
large amounts of bibliographic data. Bibliograhic data is
organized in
graphs so I threw them over the fence into ES, and by simple term
queries I can build connected graphs, using ES JSON sources for
grouping
algorithms, which I like to see them evolved into JSON-LD. Also I
developed domain-specific (german language based) add-ons and
match keys
for processing unlinked bibliographic data (for matching authority
data,
author names e.g., for academic papers).

I haven't analyzed the code how Duke uses Lucene but I think it is a
very cool idea to use Duke code to become an ES app. GIventhe
presentation in http://de.slideshare.net/larsga/deduplication
<http://de.slideshare.net/larsga/deduplication> there
could be much improvement, for speed and versatility.

For being a generic instrument, I am not sure if a single ES
plugin can
cover all the needs of the users out there, I think something like a
simple dedup REST API hiding a river and a transformation step
could be
a start, to help (non-Java) developers setting up ES based dedup
processing very fast. By adding more specific dedup methods, maybe
several plugins could be feasible.

Jörg

Am 12.04.13 09:58, schrieb Yann Barraud:
> Hi,
>
> I'm currently working on search engines, data cleaning and so on
these
> last days. The challenge I'm facing right now is explaining that a
> search engine ((ie. ElasticSearch <http://www.elasticsearch.org>in
> this case) on its own can not be used for identity resolution.
Lucene
> posts made things easier
> (http://wiki.apache.org/lucene-java/ScoresAsPercentages
<http://wiki.apache.org/lucene-java/ScoresAsPercentages> &
>
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
<http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F>
).
>
> I've been playing with Duke <http://code.google.com/p/duke>project
> also, for batch data deduplication. It's been very powerful, and
> covering requirements for batch needs.
>
> Now I'm wondering if there is not an opportunity to merge at some
> point the two projets to get some fast live identity resolution
service.
>
> I'd say :
>
>  1. duke delegates data analysis & indexing to
>     ElasticSearch<http://www.elasticsearch.org
<http://www.elasticsearch.org>>(as they both rely on
>     Lucene indexes)
>  2. duke <http://code.google.com/p/duke>turns into an ES plugin
to get
>     records matching query with Bayesian probability as an output.
>
> What do you guys think about it ?
>
> Regards,
> Yann Barraud
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from
it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>.
>
>

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #6

Thanks ! I'll look at it ASAP !

Le vendredi 12 avril 2013 11:28:48 UTC+2, Jörg Prante a écrit :

FYI there is activity in the area of text classification (which is a
generalization of dedup) in the Lucene community, see

http://de.slideshare.net/teofili/text-categorization-with-lucene-and-solr

and a simple bayesian classifier has been added recently to the Lucene
codebase

http://lucene.apache.org/core/4_2_0/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html

I think this can be exposed by ES API as well.

Jörg

Am 12.04.13 11:13, schrieb Yann Barraud:

Hi Jörg,

The need I have at the moment is to get a simple service that could
match an identity within my data and get a probability match over it.

As far as I understand how Duke works, it first matches records with
Lucene, then calculates Bayesian probability based on the XML
configuration.

If I'm right, what I would implement is :

  • A new Query type (ES plugin ?) involving Bayesian probability for
    each matching record, based an a specific type that would be the
    actual XML configuration duke uses.

This would be an answer to Lucene posts I mentioned sooner.

Lars has done a very good job with Duke. I think it is a very valuable
asset, and I regret not having time to get my coding skills back at
the moment, to help more. But if we manage, one way or another, to
have both projects working together, I think we cold start building an
incredibly fast entity resolution service.

Le vendredi 12 avril 2013 10:47:16 UTC+2, Jörg Prante a écrit :

+1 

I developed JDBC/CSV/RDF plugins with the intent of growing them 
later 
into a data cleaning / data transformation platform. There are 
many good 
things in Duke. I am developing an ES based engine for transforming 
large amounts of bibliographic data. Bibliograhic data is 
organized in 
graphs so I threw them over the fence into ES, and by simple term 
queries I can build connected graphs, using ES JSON sources for 
grouping 
algorithms, which I like to see them evolved into JSON-LD. Also I 
developed domain-specific (german language based) add-ons and 
match keys 
for processing unlinked bibliographic data (for matching authority 
data, 
author names e.g., for academic papers). 

I haven't analyzed the code how Duke uses Lucene but I think it is a 
very cool idea to use Duke code to become an ES app. GIventhe 
presentation in http://de.slideshare.net/larsga/deduplication 
<http://de.slideshare.net/larsga/deduplication> there 
could be much improvement, for speed and versatility. 

For being a generic instrument, I am not sure if a single ES 
plugin can 
cover all the needs of the users out there, I think something like a 
simple dedup REST API hiding a river and a transformation step 
could be 
a start, to help (non-Java) developers setting up ES based dedup 
processing very fast. By adding more specific dedup methods, maybe 
several plugins could be feasible. 

Jörg 

Am 12.04.13 09:58, schrieb Yann Barraud: 
> Hi, 
> 
> I'm currently working on search engines, data cleaning and so on 
these 
> last days. The challenge I'm facing right now is explaining that a 
> search engine ((ie. ElasticSearch <http://www.elasticsearch.org>in 
> this case) on its own can not be used for identity resolution. 
Lucene 
> posts made things easier 
> (http://wiki.apache.org/lucene-java/ScoresAsPercentages 
<http://wiki.apache.org/lucene-java/ScoresAsPercentages> & 
> 

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

<

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F>

). 
> 
> I've been playing with Duke <http://code.google.com/p/duke>project 
> also, for batch data deduplication. It's been very powerful, and 
> covering requirements for batch needs. 
> 
> Now I'm wondering if there is not an opportunity to merge at some 
> point the two projets to get some fast live identity resolution 
service. 
> 
> I'd say : 
> 
>  1. duke delegates data analysis & indexing to 
>     ElasticSearch<http://www.elasticsearch.org 
<http://www.elasticsearch.org>>(as they both rely on 
>     Lucene indexes) 
>  2. duke <http://code.google.com/p/duke>turns into an ES plugin 
to get 
>     records matching query with Bayesian probability as an output. 
> 
> What do you guys think about it ? 
> 
> Regards, 
> Yann Barraud 
> -- 
> You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> To unsubscribe from this group and stop receiving emails from 
it, send 
> an email to elasticsearc...@googlegroups.com. 
> For more options, visit https://groups.google.com/groups/opt_out 
<https://groups.google.com/groups/opt_out>. 
> 
> 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Clinton Gormley) #7

On Fri, 2013-04-12 at 11:28 +0200, Jörg Prante wrote:

FYI there is activity in the area of text classification (which is a
generalization of dedup) in the Lucene community, see

http://de.slideshare.net/teofili/text-categorization-with-lucene-and-solr

and a simple bayesian classifier has been added recently to the Lucene
codebase

http://lucene.apache.org/core/4_2_0/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html

I think this can be exposed by ES API as well.

Great presentations - looks exciting.

Jörg

Am 12.04.13 11:13, schrieb Yann Barraud:

Hi Jörg,

The need I have at the moment is to get a simple service that could
match an identity within my data and get a probability match over it.

As far as I understand how Duke works, it first matches records with
Lucene, then calculates Bayesian probability based on the XML
configuration.

If I'm right, what I would implement is :

  • A new Query type (ES plugin ?) involving Bayesian probability for
    each matching record, based an a specific type that would be the
    actual XML configuration duke uses.

This would be an answer to Lucene posts I mentioned sooner.

Lars has done a very good job with Duke. I think it is a very valuable
asset, and I regret not having time to get my coding skills back at
the moment, to help more. But if we manage, one way or another, to
have both projects working together, I think we cold start building an
incredibly fast entity resolution service.

Le vendredi 12 avril 2013 10:47:16 UTC+2, Jörg Prante a écrit :

+1

I developed JDBC/CSV/RDF plugins with the intent of growing them
later
into a data cleaning / data transformation platform. There are
many good
things in Duke. I am developing an ES based engine for transforming
large amounts of bibliographic data. Bibliograhic data is
organized in
graphs so I threw them over the fence into ES, and by simple term
queries I can build connected graphs, using ES JSON sources for
grouping
algorithms, which I like to see them evolved into JSON-LD. Also I
developed domain-specific (german language based) add-ons and
match keys
for processing unlinked bibliographic data (for matching authority
data,
author names e.g., for academic papers).

I haven't analyzed the code how Duke uses Lucene but I think it is a
very cool idea to use Duke code to become an ES app. GIventhe
presentation in http://de.slideshare.net/larsga/deduplication
<http://de.slideshare.net/larsga/deduplication> there
could be much improvement, for speed and versatility.

For being a generic instrument, I am not sure if a single ES
plugin can
cover all the needs of the users out there, I think something like a
simple dedup REST API hiding a river and a transformation step
could be
a start, to help (non-Java) developers setting up ES based dedup
processing very fast. By adding more specific dedup methods, maybe
several plugins could be feasible.

Jörg

Am 12.04.13 09:58, schrieb Yann Barraud:
> Hi,
>
> I'm currently working on search engines, data cleaning and so on
these
> last days. The challenge I'm facing right now is explaining that a
> search engine ((ie. ElasticSearch <http://www.elasticsearch.org>in
> this case) on its own can not be used for identity resolution.
Lucene
> posts made things easier
> (http://wiki.apache.org/lucene-java/ScoresAsPercentages
<http://wiki.apache.org/lucene-java/ScoresAsPercentages> &
>
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
<http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F>
).
>
> I've been playing with Duke <http://code.google.com/p/duke>project
> also, for batch data deduplication. It's been very powerful, and
> covering requirements for batch needs.
>
> Now I'm wondering if there is not an opportunity to merge at some
> point the two projets to get some fast live identity resolution
service.
>
> I'd say :
>
>  1. duke delegates data analysis & indexing to
>     ElasticSearch<http://www.elasticsearch.org
<http://www.elasticsearch.org>>(as they both rely on
>     Lucene indexes)
>  2. duke <http://code.google.com/p/duke>turns into an ES plugin
to get
>     records matching query with Bayesian probability as an output.
>
> What do you guys think about it ?
>
> Regards,
> Yann Barraud
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from
it, send
> an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>.
>
>

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #8

Indeed ! I'm currently browsing Duke code. It is quite simple & clear. I
wish I could scrapbook a plugin, but do not know yet where to start.

Any pointer you could give me ?

Le vendredi 12 avril 2013 12:43:38 UTC+2, Clinton Gormley a écrit :

On Fri, 2013-04-12 at 11:28 +0200, Jörg Prante wrote:

FYI there is activity in the area of text classification (which is a
generalization of dedup) in the Lucene community, see

http://de.slideshare.net/teofili/text-categorization-with-lucene-and-solr

and a simple bayesian classifier has been added recently to the Lucene
codebase

http://lucene.apache.org/core/4_2_0/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html

I think this can be exposed by ES API as well.

Great presentations - looks exciting.

Jörg

Am 12.04.13 11:13, schrieb Yann Barraud:

Hi Jörg,

The need I have at the moment is to get a simple service that could
match an identity within my data and get a probability match over it.

As far as I understand how Duke works, it first matches records with
Lucene, then calculates Bayesian probability based on the XML
configuration.

If I'm right, what I would implement is :

  • A new Query type (ES plugin ?) involving Bayesian probability for
    each matching record, based an a specific type that would be the
    actual XML configuration duke uses.

This would be an answer to Lucene posts I mentioned sooner.

Lars has done a very good job with Duke. I think it is a very valuable
asset, and I regret not having time to get my coding skills back at
the moment, to help more. But if we manage, one way or another, to
have both projects working together, I think we cold start building an
incredibly fast entity resolution service.

Le vendredi 12 avril 2013 10:47:16 UTC+2, Jörg Prante a écrit :

+1 

I developed JDBC/CSV/RDF plugins with the intent of growing them 
later 
into a data cleaning / data transformation platform. There are 
many good 
things in Duke. I am developing an ES based engine for 

transforming

large amounts of bibliographic data. Bibliograhic data is 
organized in 
graphs so I threw them over the fence into ES, and by simple term 
queries I can build connected graphs, using ES JSON sources for 
grouping 
algorithms, which I like to see them evolved into JSON-LD. Also I 
developed domain-specific (german language based) add-ons and 
match keys 
for processing unlinked bibliographic data (for matching authority 
data, 
author names e.g., for academic papers). 

I haven't analyzed the code how Duke uses Lucene but I think it is 

a

very cool idea to use Duke code to become an ES app. GIventhe 
presentation in http://de.slideshare.net/larsga/deduplication 
<http://de.slideshare.net/larsga/deduplication> there 
could be much improvement, for speed and versatility. 

For being a generic instrument, I am not sure if a single ES 
plugin can 
cover all the needs of the users out there, I think something like 

a

simple dedup REST API hiding a river and a transformation step 
could be 
a start, to help (non-Java) developers setting up ES based dedup 
processing very fast. By adding more specific dedup methods, maybe 
several plugins could be feasible. 

Jörg 

Am 12.04.13 09:58, schrieb Yann Barraud: 
> Hi, 
> 
> I'm currently working on search engines, data cleaning and so on 
these 
> last days. The challenge I'm facing right now is explaining that 

a

> search engine ((ie. ElasticSearch <http://www.elasticsearch.org>in 
> this case) on its own can not be used for identity resolution. 
Lucene 
> posts made things easier 
> (http://wiki.apache.org/lucene-java/ScoresAsPercentages 
<http://wiki.apache.org/lucene-java/ScoresAsPercentages> & 
> 

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

<

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F>

). 
> 
> I've been playing with Duke <http://code.google.com/p/duke>project 
> also, for batch data deduplication. It's been very powerful, and 
> covering requirements for batch needs. 
> 
> Now I'm wondering if there is not an opportunity to merge at 

some

> point the two projets to get some fast live identity resolution 
service. 
> 
> I'd say : 
> 
>  1. duke delegates data analysis & indexing to 
>     ElasticSearch<http://www.elasticsearch.org 
<http://www.elasticsearch.org>>(as they both rely on 
>     Lucene indexes) 
>  2. duke <http://code.google.com/p/duke>turns into an ES plugin 
to get 
>     records matching query with Bayesian probability as an 

output.

> 
> What do you guys think about it ? 
> 
> Regards, 
> Yann Barraud 
> -- 
> You received this message because you are subscribed to the 

Google

> Groups "elasticsearch" group. 
> To unsubscribe from this group and stop receiving emails from 
it, send 
> an email to elasticsearc...@googlegroups.com. 
> For more options, visit https://groups.google.com/groups/opt_out 
<https://groups.google.com/groups/opt_out>. 
> 
> 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #9

I'm not very familiar with the Duke code internals.

The Lucene approach is something like this: first, throw a lot of
training documents into a Lucene index. Then call "train" method in
SimpleNaiveBayesClassfiier and give the classifier a name. In a second
step, you apply the SimpleNaiveBayesClassfiier to a given docuemnt (ES
JSON) with "getAssignedClass". The JSON is then enriched with a field
where the found class is written in, and the probability is given in
"getScore()" which could also be added to the given JSON. The JSON could
be handled like in the current "MoreLikeThis" ES action.

I would try to figure out if the Duke approach is more versatile than
the simple Lucene approach, and if the Lucene approach could somehow be
extended to fit the Duke methods.

And, you might consider how a "classify API" in ES HTTP REST could look
like... something like "_train" and "_assign" endpoints could be added...

Jörg

Am 12.04.13 13:06, schrieb Yann Barraud:

Indeed ! I'm currently browsing Duke code. It is quite simple & clear.
I wish I could scrapbook a plugin, but do not know yet where to start.

Any pointer you could give me ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #10

Hi J¨rog,

In fact what I have in mind (which might be erroneous) is the following :

  • Enrich (extend) a request, by adding some matching configuration, more
    or less the way Duke is configured :
{
    "term" : { "user" : "kimchy" },

}

Le vendredi 12 avril 2013 15:22:33 UTC+2, Jörg Prante a écrit :

I'm not very familiar with the Duke code internals.

The Lucene approach is something like this: first, throw a lot of
training documents into a Lucene index. Then call "train" method in
SimpleNaiveBayesClassfiier and give the classifier a name. In a second
step, you apply the SimpleNaiveBayesClassfiier to a given docuemnt (ES
JSON) with "getAssignedClass". The JSON is then enriched with a field
where the found class is written in, and the probability is given in
"getScore()" which could also be added to the given JSON. The JSON could
be handled like in the current "MoreLikeThis" ES action.

I would try to figure out if the Duke approach is more versatile than
the simple Lucene approach, and if the Lucene approach could somehow be
extended to fit the Duke methods.

And, you might consider how a "classify API" in ES HTTP REST could look
like... something like "_train" and "_assign" endpoints could be added...

Jörg

Am 12.04.13 13:06, schrieb Yann Barraud:

Indeed ! I'm currently browsing Duke code. It is quite simple & clear.
I wish I could scrapbook a plugin, but do not know yet where to start.

Any pointer you could give me ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #11

Hi Jörg,

In fact what I have in mind (which might be erroneous) is the following :

{ "term" : { "user" : "kimchy" },
"entity_resolution" : {
"field" : {
"name" : "user",
"cleaner" : "asciifolding, toLowerCase",
"comparator" : "levenshtein",
"min" : "0.3",
"max" : "0.8"
}
}

Then, for each match, apply this configuration calling double
no.priv.garshol.duke.Processor.compare(Record, Record) where the returned
value is the probability the matching value is the same that the requested
one.

The answer should be the exact way it is now, more a field (called
_matching_probability ?) containing this value.

Do you guys think this is a good way to do this ?

I would then implement it as a plugin. But I have to find out how to (and
if I have to) extend or create new search request and then enrich QueryDSL.

PS : spent a few hours awake this night thinking about it. I think this
should be whether an option within Term Query or a new Query based on Term
Query.

Regards,

Yann

Le vendredi 12 avril 2013 15:22:33 UTC+2, Jörg Prante a écrit :

I'm not very familiar with the Duke code internals.

The Lucene approach is something like this: first, throw a lot of
training documents into a Lucene index. Then call "train" method in
SimpleNaiveBayesClassfiier and give the classifier a name. In a second
step, you apply the SimpleNaiveBayesClassfiier to a given docuemnt (ES
JSON) with "getAssignedClass". The JSON is then enriched with a field
where the found class is written in, and the probability is given in
"getScore()" which could also be added to the given JSON. The JSON could
be handled like in the current "MoreLikeThis" ES action.

I would try to figure out if the Duke approach is more versatile than
the simple Lucene approach, and if the Lucene approach could somehow be
extended to fit the Duke methods.

And, you might consider how a "classify API" in ES HTTP REST could look
like... something like "_train" and "_assign" endpoints could be added...

Jörg

Am 12.04.13 13:06, schrieb Yann Barraud:

Indeed ! I'm currently browsing Duke code. It is quite simple & clear.
I wish I could scrapbook a plugin, but do not know yet where to start.

Any pointer you could give me ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #12

Yann, you should check out how FuzzyQuery works.

http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query/

Jörg

Am 14.04.13 13:00, schrieb Yann Barraud:

PS : spent a few hours awake this night thinking about it. I think
this should be whether an option within Term Query or a new Query
based on Term Query.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #13

Hi,

I looked at the FuzzyQuery. It is really interesting for what I have in
mind. But if I understood well, all queries are Licene Queries, and thus
results are all parsed the same way. Am I right ?

This means I can not inject more a treatment (Bayesain scoring) to my
results even with a new Query plugin ?

Cordialement,
Yann Barraud

2013/4/14 Jörg Prante joergprante@gmail.com

Yann, you should check out how FuzzyQuery works.

http://www.elasticsearch.org/**guide/reference/query-dsl/**fuzzy-query/http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query/

http://blog.mikemccandless.com/2011/03/lucenes-
fuzzyquery-is-100-times-**faster.htmlhttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

Jörg

Am 14.04.13 13:00, schrieb Yann Barraud:

PS : spent a few hours awake this night thinking about it. I think this

should be whether an option within Term Query or a new Query based on Term
Query.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**YMNaGOxTda4/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/YMNaGOxTda4/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #14

Ok, digging & digging, I found out implementig a CustomScoreQuery with a
native script is a more standard answer regarding ES architecture.

Going this way.

Hope I won't get lost once more. :wink:

Cordialement,
Yann Barraud

2013/4/15 Yann Barraud yann.barraud@gmail.com

Hi,

I looked at the FuzzyQuery. It is really interesting for what I have in
mind. But if I understood well, all queries are Licene Queries, and thus
results are all parsed the same way. Am I right ?

This means I can not inject more a treatment (Bayesain scoring) to my
results even with a new Query plugin ?

Cordialement,
Yann Barraud

2013/4/14 Jörg Prante joergprante@gmail.com

Yann, you should check out how FuzzyQuery works.

http://www.elasticsearch.org/**guide/reference/query-dsl/**fuzzy-query/http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query/

http://blog.mikemccandless.com/2011/03/lucenes-
fuzzyquery-is-100-times-**faster.htmlhttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

Jörg

Am 14.04.13 13:00, schrieb Yann Barraud:

PS : spent a few hours awake this night thinking about it. I think this

should be whether an option within Term Query or a new Query based on Term
Query.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**YMNaGOxTda4/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/YMNaGOxTda4/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #15

Project finally on its way !

Le mardi 16 avril 2013 10:20:32 UTC+2, Yann Barraud a écrit :

Ok, digging & digging, I found out implementig a CustomScoreQuery with a
native script is a more standard answer regarding ES architecture.

Going this way.

Hope I won't get lost once more. :wink:

Cordialement,
Yann Barraud

2013/4/15 Yann Barraud yann.barraud@gmail.com

Hi,

I looked at the FuzzyQuery. It is really interesting for what I have in
mind. But if I understood well, all queries are Licene Queries, and thus
results are all parsed the same way. Am I right ?

This means I can not inject more a treatment (Bayesain scoring) to my
results even with a new Query plugin ?

Cordialement,
Yann Barraud

2013/4/14 Jörg Prante joergprante@gmail.com

Yann, you should check out how FuzzyQuery works.

http://www.elasticsearch.org/**guide/reference/query-dsl/**fuzzy-query/http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query/

http://blog.mikemccandless.com/2011/03/lucenes-
fuzzyquery-is-100-times-**faster.htmlhttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

Jörg

Am 14.04.13 13:00, schrieb Yann Barraud:

PS : spent a few hours awake this night thinking about it. I think this

should be whether an option within Term Query or a new Query based on Term
Query.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**YMNaGOxTda4/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/YMNaGOxTda4/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #16

Project finally on its way !!

Le vendredi 12 avril 2013 14:54:04 UTC+2, Lars Marius Garshol a écrit :

  • Yann Barraud

I fact my sugesstion is a cross post to ES group.

Yeah, I saw.

Let's say I indexed all my data within ES, with the right configuration
(analyzers, filters & co). The plugin could hook the query and add a
Bayesian score for each and every match ES returns. In fact, this would
imply using the class/methods you use to calculate the score on the records
set ES returns…

Yes, that's definitely possible. Basically, you could do this by using
Processor.compare(Record, Record), which is already there and public. You
could use that without having any data sources at all.

I think if you want to take this further, probably the best thing to do is
to simply write the plugin. It doesn't sound like it should be hard at all.
I'd be very happy to add it to Duke as a contribution. Or it could go into
ElasticSearch, if they want it. Either way is fine with me.

--
Lars Marius Garshol | Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #17