Apply synonyms that include confidence weights

Jake_M · March 3, 2014, 12:20am

I want to apply weights to different synonyms because in some cases we are
not sure the relationship is legitimate or the relationship is not a true
synonym. For example I would like 'doctor' and 'nurse' to be related such
that a search for 'doctor' will also return documents containing 'nurse'
but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these are
the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:

gist.github.com

https://gist.github.com/clintongormley/4095280

gistfile1.md

We create an index with:

 * two filters: `synonyms_expand` and `synonyms_contract`
 * two analyzers: `synonyms_expand` and `synonyms_contract`
 * three text fields:
   * `text_1` uses the `synonyms_expand` analyzer at index and search time
   * `text_2` uses the `synonyms_expand` analyzer at index time, but the `standard` analyzer at search time
   * `text_3` uses the `synonyms_contract` analyzer at index and search time

.

This file has been truncated. show original

This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · March 3, 2014, 3:36pm

You can always disable term frequencies on a field to eliminate the td-idf
issue, but then scoring would be affected and perhaps be more detrimental
than the original problem.

The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/payloads/PayloadTermQuery.html

Cheers,

Ivan

On Sun, Mar 2, 2014 at 4:20 PM, Jake M jakemagner90@gmail.com wrote:

I want to apply weights to different synonyms because in some cases we are
not sure the relationship is legitimate or the relationship is not a true
synonym. For example I would like 'doctor' and 'nurse' to be related such
that a search for 'doctor' will also return documents containing 'nurse'
but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these are
the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3Dsbp_MW1NQ8L-A4fKXjSxDAQc9zZLzW6Q1USYX73xRw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jake_M · March 3, 2014, 7:27pm

I don't quite follow the steps I need to take to implement this. Do I need
to overwrite the SynonymTokenFilter and Similarity classes to add weight
payloads? Or can this be achieved with scripts in ES?

Thanks for your help,
Jake

On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:

You can always disable term frequencies on a field to eliminate the td-idf
issue, but then scoring would be affected and perhaps be more detrimental
than the original problem.

The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.

PayloadTermQuery (Lucene 4.7.0 API)

Cheers,

Ivan

On Sun, Mar 2, 2014 at 4:20 PM, Jake M <jakema...@gmail.com <javascript:>>wrote:

I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · March 3, 2014, 10:27pm

Yes, unfortunately you would need to write a lot of code and deploy it is a
plugin. That is why I mentioned "it requires a lot of boilerplate code". I
don't see how else you can add weights to synonyms. Elasticsearch exposes
Lucene's DelimitedPayloadTokenFilter, but you want to configure weights
with synonyms and not just via the text analyzed.

The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch, only
directly in Lucene, so I could be wrong.

[1]

[2]

Cheers,

Ivan

On Mon, Mar 3, 2014 at 11:27 AM, Jake M jakemagner90@gmail.com wrote:

I don't quite follow the steps I need to take to implement this. Do I need
to overwrite the SynonymTokenFilter and Similarity classes to add weight
payloads? Or can this be achieved with scripts in ES?

Thanks for your help,
Jake

On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:

You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.

The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.

Index of /__root/docs.lucene.apache.org/core/4_7_0/core/org/apache
lucene/search/payloads/PayloadTermQuery.html

Cheers,

Ivan

On Sun, Mar 2, 2014 at 4:20 PM, Jake M jakema...@gmail.com wrote:

I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%
40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC0nAB6ovx95cR42zA3w5jJn%3Dgu0MjVSk_PKPT8m50Nmw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jake_M · March 4, 2014, 12:52am

Got it. Thanks Ivan, this has been very helpful.

Best,
Jake

On Monday, March 3, 2014 2:27:08 PM UTC-8, Ivan Brusic wrote:

Yes, unfortunately you would need to write a lot of code and deploy it is
a plugin. That is why I mentioned "it requires a lot of boilerplate
code". I don't see how else you can add weights to synonyms. Elasticsearch
exposes Lucene's DelimitedPayloadTokenFilter, but you want to configure
weights with synonyms and not just via the text analyzed.

The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch,
only directly in Lucene, so I could be wrong.

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic

[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers,

Ivan

On Mon, Mar 3, 2014 at 11:27 AM, Jake M <jakema...@gmail.com <javascript:>

wrote:

I don't quite follow the steps I need to take to implement this. Do I
need to overwrite the SynonymTokenFilter and Similarity classes to add
weight payloads? Or can this be achieved with scripts in ES?

Thanks for your help,
Jake

On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:

You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.

The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.

Index of /__root/docs.lucene.apache.org/core/4_7_0/core/org/apache
lucene/search/payloads/PayloadTermQuery.html

Cheers,

Ivan

On Sun, Mar 2, 2014 at 4:20 PM, Jake M jakema...@gmail.com wrote:

I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%
40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a3b81f06-b647-437a-8605-13c53856a726%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · March 4, 2014, 5:01pm

Hopefully you can find a way to make things work with less code. It would
be great if payloads were more of a first class citizen in Elasticsearch,
but it is up to the Lucene layer to handle analysis. I really need to play
around with the "new" text scoring abilities.

--
Ivan

On Mon, Mar 3, 2014 at 4:52 PM, Jake M jakemagner90@gmail.com wrote:

Got it. Thanks Ivan, this has been very helpful.

Best,
Jake

On Monday, March 3, 2014 2:27:08 PM UTC-8, Ivan Brusic wrote:

Yes, unfortunately you would need to write a lot of code and deploy it is
a plugin. That is why I mentioned "it requires a lot of boilerplate
code". I don't see how else you can add weights to synonyms. Elasticsearch
exposes Lucene's DelimitedPayloadTokenFilter, but you want to configure
weights with synonyms and not just via the text analyzed.

The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch,
only directly in Lucene, so I could be wrong.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/analysis-delimited-payload-tokenfilter.html

[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-advanced-scripting.html#term
positions_offsets_and_payloads

Cheers,

Ivan

On Mon, Mar 3, 2014 at 11:27 AM, Jake M jakema...@gmail.com wrote:

I don't quite follow the steps I need to take to implement this. Do I
need to overwrite the SynonymTokenFilter and Similarity classes to add
weight payloads? Or can this be achieved with scripts in ES?

Thanks for your help,
Jake

On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:

You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.

The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.

Index of /__root/docs.lucene.apache.org/core/4_7_0/core/org/apache/lucene
search/payloads/PayloadTermQuery.html

Cheers,

Ivan

On Sun, Mar 2, 2014 at 4:20 PM, Jake M jakema...@gmail.com wrote:

I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?

I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want

A) Apply the synonym expansions at index time into different
confidence synonym fields. This works but then tf-idf scoring for those
fields is off because they only contain synonyms. The result is that a
search for 'doctor' actually returns the results with 'nurse' first because
it has higher term frequency being the only word in the synonym field.

B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40goo
glegroups.com https://groups.google.com/d/msgid/elasticsearch/af5439ea-1d03-4717-89d6-e79f9672cf7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%
40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0507e769-d1aa-4f20-a9a0-1ebb2dc6611e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a3b81f06-b647-437a-8605-13c53856a726%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a3b81f06-b647-437a-8605-13c53856a726%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCZMJ2W8aYJBYzO8vtz3ETE-bToRPmcSJ-VCM0qWnpzpg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.