ANN : elasticsearch-entity-resolution plugin 0.1


(Yann Barraud) #1

Hi folks,

I'm happy to announce my first release of Elasticsearch entity resolution
plugin.

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #2

Intriguing! Will play around with this, thanks for making it available!

So the score of the each document is proportional to how "similar" the two
items are, based on the deduplication parameters provided? Very naive
question, I don't know anything about deduplication: how does this differ
from doing regular searches that find similar documents (like a fuzzy or
more-like-this query)?

-Zach

On Friday, September 13, 2013 12:55:02 PM UTC-4, Yann Barraud wrote:

Hi folks,

I'm happy to announce my first release of Elasticsearch entity resolution
plugin.

https://github.com/YannBrrd/elasticsearch-entity-resolution/

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #3

Well, this differentiates search & scoring. Basically, it is not only about
search matching items, it about know how similar an item is to another.
This is really different.

You might have a look there
http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

Quote :

Can I filter by score?
Not safely. You can always pick an arbitrary score value and then
check the Hits object to see how many results have a score higher
than that value (a Binary search might come in handy) but it really
doesn't give you any meaningful information because of the way score
is calculated...

Scores As Percentages

People frequently want to compute a "Percentage" from Lucene scores
to determine what is a "100% perfect" match vs a "50%" match. This
is also somethings called a "normalized score"

Don't do this.

Seriously. Stop trying to think about your problem this way, it's not
going to end well.

This project just fills in the gaps...

Cordialement,
Yann Barraud

2013/9/15 Zachary Tong zacharyjtong@gmail.com

Intriguing! Will play around with this, thanks for making it available!

So the score of the each document is proportional to how "similar" the two
items are, based on the deduplication parameters provided? Very naive
question, I don't know anything about deduplication: how does this differ
from doing regular searches that find similar documents (like a fuzzy or
more-like-this query)?

-Zach

On Friday, September 13, 2013 12:55:02 PM UTC-4, Yann Barraud wrote:

Hi folks,

I'm happy to announce my first release of Elasticsearch entity resolution
plugin.

https://github.com/YannBrrd/**elasticsearch-entity-**resolution/https://github.com/YannBrrd/elasticsearch-entity-resolution/

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #4

Gotcha, that makes sense. Thanks for the links. Will definitely spend
some time playing with this, I can see it being very useful.

Cheers,
-Zach

On Mon, Sep 16, 2013 at 3:06 AM, Yann Barraud yann.barraud@gmail.comwrote:

Well, this differentiates search & scoring. Basically, it is not only
about search matching items, it about know how similar an item is to
another. This is really different.

You might have a look there
http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

Quote :

Can I filter by score?
Not safely. You can always pick an arbitrary score value and then check the Hits object to see how many results have a score higher than that value (a Binary search might come in handy) but it really doesn't give you any meaningful information because of the way score is calculated...

Scores As Percentages

People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"

Don't do this.

Seriously. Stop trying to think about your problem this way, it's not going to end well.

This project just fills in the gaps...

Cordialement,
Yann Barraud

2013/9/15 Zachary Tong zacharyjtong@gmail.com

Intriguing! Will play around with this, thanks for making it available!

So the score of the each document is proportional to how "similar" the
two items are, based on the deduplication parameters provided? Very naive
question, I don't know anything about deduplication: how does this differ
from doing regular searches that find similar documents (like a fuzzy or
more-like-this query)?

-Zach

On Friday, September 13, 2013 12:55:02 PM UTC-4, Yann Barraud wrote:

Hi folks,

I'm happy to announce my first release of Elasticsearch entity
resolution plugin.

https://github.com/YannBrrd/**elasticsearch-entity-**resolution/https://github.com/YannBrrd/elasticsearch-entity-resolution/

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #5

Good to hear that ! :slight_smile:

As you're an ES dev, I have a question for you : I'm looking for any code
example about how to read data for index while instanciating the plugin ?
I'd love to be able to store conf also within the index itself (type :
entity-conf, doc named in the JSON payload). Do you have any pointer ? Any
doc ?

Thanks !

Cordialement,
Yann Barraud

2013/9/16 Zachary Tong zacharyjtong@gmail.com

Gotcha, that makes sense. Thanks for the links. Will definitely spend
some time playing with this, I can see it being very useful.

Cheers,
-Zach

On Mon, Sep 16, 2013 at 3:06 AM, Yann Barraud yann.barraud@gmail.comwrote:

Well, this differentiates search & scoring. Basically, it is not only
about search matching items, it about know how similar an item is to
another. This is really different.

You might have a look there
http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

Quote :

Can I filter by score?
Not safely. You can always pick an arbitrary score value and then check the Hits object to see how many results have a score higher than that value (a Binary search might come in handy) but it really doesn't give you any meaningful information because of the way score is calculated...

Scores As Percentages

People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"

Don't do this.

Seriously. Stop trying to think about your problem this way, it's not going to end well.

This project just fills in the gaps...

Cordialement,
Yann Barraud

2013/9/15 Zachary Tong zacharyjtong@gmail.com

Intriguing! Will play around with this, thanks for making it available!

So the score of the each document is proportional to how "similar" the
two items are, based on the deduplication parameters provided? Very naive
question, I don't know anything about deduplication: how does this differ
from doing regular searches that find similar documents (like a fuzzy or
more-like-this query)?

-Zach

On Friday, September 13, 2013 12:55:02 PM UTC-4, Yann Barraud wrote:

Hi folks,

I'm happy to announce my first release of Elasticsearch entity
resolution plugin.

https://github.com/YannBrrd/**elasticsearch-entity-**resolution/https://github.com/YannBrrd/elasticsearch-entity-resolution/

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Yann Barraud) #6

Self reply :
https://github.com/imotov/elasticsearch-native-script-example/blob/master/src/main/java/org/elasticsearch/examples/nativescript/script/LookupScript.javathis
should do the trick :wink:

Cordialement,
Yann Barraud

2013/9/16 Yann Barraud yann.barraud@gmail.com

Good to hear that ! :slight_smile:

As you're an ES dev, I have a question for you : I'm looking for any code
example about how to read data for index while instanciating the plugin ?
I'd love to be able to store conf also within the index itself (type :
entity-conf, doc named in the JSON payload). Do you have any pointer ? Any
doc ?

Thanks !

Cordialement,
Yann Barraud

2013/9/16 Zachary Tong zacharyjtong@gmail.com

Gotcha, that makes sense. Thanks for the links. Will definitely spend
some time playing with this, I can see it being very useful.

Cheers,
-Zach

On Mon, Sep 16, 2013 at 3:06 AM, Yann Barraud yann.barraud@gmail.comwrote:

Well, this differentiates search & scoring. Basically, it is not only
about search matching items, it about know how similar an item is to
another. This is really different.

You might have a look there
http://wiki.apache.org/lucene-java/ScoresAsPercentages &
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F

Quote :

Can I filter by score?
Not safely. You can always pick an arbitrary score value and then check the Hits object to see how many results have a score higher than that value (a Binary search might come in handy) but it really doesn't give you any meaningful information because of the way score is calculated...

Scores As Percentages

People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"

Don't do this.

Seriously. Stop trying to think about your problem this way, it's not going to end well.

This project just fills in the gaps...

Cordialement,
Yann Barraud

2013/9/15 Zachary Tong zacharyjtong@gmail.com

Intriguing! Will play around with this, thanks for making it available!

So the score of the each document is proportional to how "similar" the
two items are, based on the deduplication parameters provided? Very naive
question, I don't know anything about deduplication: how does this differ
from doing regular searches that find similar documents (like a fuzzy or
more-like-this query)?

-Zach

On Friday, September 13, 2013 12:55:02 PM UTC-4, Yann Barraud wrote:

Hi folks,

I'm happy to announce my first release of Elasticsearch entity
resolution plugin.

https://github.com/YannBrrd/**elasticsearch-entity-**resolution/https://github.com/YannBrrd/elasticsearch-entity-resolution/

Hope you'll enjoy it !

Regards,
Yann

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/N5UDIv9_aeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7