I want to apply weights to different synonyms because in some cases we are
not sure the relationship is legitimate or the relationship is not a true
synonym. For example I would like 'doctor' and 'nurse' to be related such
that a search for 'doctor' will also return documents containing 'nurse'
but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these are
the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here:
This works great but treats the synonyms as perfectly equal.
You can always disable term frequencies on a field to eliminate the td-idf
issue, but then scoring would be affected and perhaps be more detrimental
than the original problem.
The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.
I want to apply weights to different synonyms because in some cases we are
not sure the relationship is legitimate or the relationship is not a true
synonym. For example I would like 'doctor' and 'nurse' to be related such
that a search for 'doctor' will also return documents containing 'nurse'
but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these are
the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here: Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.
I don't quite follow the steps I need to take to implement this. Do I need
to overwrite the SynonymTokenFilter and Similarity classes to add weight
payloads? Or can this be achieved with scripts in ES?
Thanks for your help,
Jake
On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:
You can always disable term frequencies on a field to eliminate the td-idf
issue, but then scoring would be affected and perhaps be more detrimental
than the original problem.
The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.
On Sun, Mar 2, 2014 at 4:20 PM, Jake M <jakema...@gmail.com <javascript:>>wrote:
I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here: Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.
Yes, unfortunately you would need to write a lot of code and deploy it is a
plugin. That is why I mentioned "it requires a lot of boilerplate code". I
don't see how else you can add weights to synonyms. Elasticsearch exposes
Lucene's DelimitedPayloadTokenFilter, but you want to configure weights
with synonyms and not just via the text analyzed.
The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch, only
directly in Lucene, so I could be wrong.
I don't quite follow the steps I need to take to implement this. Do I need
to overwrite the SynonymTokenFilter and Similarity classes to add weight
payloads? Or can this be achieved with scripts in ES?
Thanks for your help,
Jake
On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:
You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.
The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.
I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here: Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.
Thanks!
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
On Monday, March 3, 2014 2:27:08 PM UTC-8, Ivan Brusic wrote:
Yes, unfortunately you would need to write a lot of code and deploy it is
a plugin. That is why I mentioned "it requires a lot of boilerplate
code". I don't see how else you can add weights to synonyms. Elasticsearch
exposes Lucene's DelimitedPayloadTokenFilter, but you want to configure
weights with synonyms and not just via the text analyzed.
The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch,
only directly in Lucene, so I could be wrong.
I don't quite follow the steps I need to take to implement this. Do I
need to overwrite the SynonymTokenFilter and Similarity classes to add
weight payloads? Or can this be achieved with scripts in ES?
Thanks for your help,
Jake
On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:
You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.
The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.
I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different confidence
synonym fields. This works but then tf-idf scoring for those fields is off
because they only contain synonyms. The result is that a search for
'doctor' actually returns the results with 'nurse' first because it has
higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here: Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.
Thanks!
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
Hopefully you can find a way to make things work with less code. It would
be great if payloads were more of a first class citizen in Elasticsearch,
but it is up to the Lucene layer to handle analysis. I really need to play
around with the "new" text scoring abilities.
On Monday, March 3, 2014 2:27:08 PM UTC-8, Ivan Brusic wrote:
Yes, unfortunately you would need to write a lot of code and deploy it is
a plugin. That is why I mentioned "it requires a lot of boilerplate
code". I don't see how else you can add weights to synonyms. Elasticsearch
exposes Lucene's DelimitedPayloadTokenFilter, but you want to configure
weights with synonyms and not just via the text analyzed.
The new function scoring [2] does expose payloads, but I have never used
it. Re-implementing TF-IDF in the function scoring can get tricky. IMHO, it
might be easier to implement your own Similarity and have it do the payload
scoring. But once again, I have never used payloads in Elasticsearch,
only directly in Lucene, so I could be wrong.
I don't quite follow the steps I need to take to implement this. Do I
need to overwrite the SynonymTokenFilter and Similarity classes to add
weight payloads? Or can this be achieved with scripts in ES?
Thanks for your help,
Jake
On Monday, March 3, 2014 7:36:26 AM UTC-8, Ivan Brusic wrote:
You can always disable term frequencies on a field to eliminate the
td-idf issue, but then scoring would be affected and perhaps be more
detrimental than the original problem.
The standard solution in Lucene is to use payloads, which is metadata
associated with a term in the index. The synonym filter will add a weight
payload for each term and these weights will be read in the scorePayload
method of the Similarity class. The concept is simple, but it requires a
lot of boilerplate code for the analysis, querying parsing and the
similarity.
I want to apply weights to different synonyms because in some cases we
are not sure the relationship is legitimate or the relationship is not a
true synonym. For example I would like 'doctor' and 'nurse' to be related
such that a search for 'doctor' will also return documents containing
'nurse' but give them a lower score. What is the best way to achieve this
functionality?
I have found several examples of how to apply exact synonyms and these
are the approaches I've considered so far but neither is exactly what I want
A) Apply the synonym expansions at index time into different
confidence synonym fields. This works but then tf-idf scoring for those
fields is off because they only contain synonyms. The result is that a
search for 'doctor' actually returns the results with 'nurse' first because
it has higher term frequency being the only word in the synonym field.
B) Use the built in synonyms support and apply synonym expansion or
contraction as shown here: Using synonyms in Elasticsearch · GitHub
This works great but treats the synonyms as perfectly equal.
Thanks!
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.