How to modify term frequency formula?


(geantbrun) #1

Hi,
If I understand well, the formula used for the term frequency part in the
default similarity module is the square root of the actual frequency. Is it
possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for
documents that have the same term repeated many times. It seems that BM25
similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#string

[4]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.patrick@gmail.com wrote:

Hi,
If I understand well, the formula used for the term frequency part in the
default similarity module is the square root of the actual frequency. Is it
possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for
documents that have the same term repeated many times. It seems that BM25
similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBKkA9-gBOYZau%3DDWn-O0f_XVqNmXJa67zSCnC1uLmV4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #3

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#string

[4]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Hi,
If I understand well, the formula used for the term frequency part in the
default similarity module is the square root of the actual frequency. Is it
possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for
documents that have the same term repeated many times. It seems that BM25
similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #4

Term frequencies are stored within Lucene, so there is no calculating of
the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun agin.patrick@gmail.com wrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.com wrote:

Hi,
If I understand well, the formula used for the term frequency part in
the default similarity module is the square root of the actual frequency.
Is it possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for
documents that have the same term repeated many times. It seems that BM25
similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCoMY8N2YgWCuzsh9MFnaQUZA6e3dhza%3DFPaB2JzUYV3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #5

Thanks again for the answer Ivan. Would it be simpler to modify directly in
the source code the way tf is calculated? I mean replacing somewhere
something like tf = sqrt(n) by tf = min(10,sqrt(n)).
Cheers,
Patrick

Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :

Term frequencies are stored within Lucene, so there is no calculating of
the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.com wrote:

Hi,
If I understand well, the formula used for the term frequency part in
the default similarity module is the square root of the actual frequency.
Is it possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for
documents that have the same term repeated many times. It seems that BM25
similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #6

Did you see Britta's slides? She has a slide called "Cosine similarity as
script" which mimics the Lucene scoring as a script. You can replace the
call to _index[field][word].tf() with your own implementation. You can
deploy the script as a native Java script (note: not Javascript) for
performance.

I find it easier to understand to just change the Similarity. Simply over
DefaultSimilarity and override "public float tf(float freq)" and then
reference this similarity in your field mapping.

--
Ivan

On Tue, Mar 25, 2014 at 6:57 AM, geantbrun agin.patrick@gmail.com wrote:

Thanks again for the answer Ivan. Would it be simpler to modify directly
in the source code the way tf is calculated? I mean replacing somewhere
something like tf = sqrt(n) by tf = min(10,sqrt(n)).
Cheers,
Patrick

Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :

Term frequencies are stored within Lucene, so there is no calculating of
the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun agin.p...@gmail.com wrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.com wrote:

Hi,
If I understand well, the formula used for the term frequency part in
the default similarity module is the square root of the actual frequency.
Is it possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's
for documents that have the same term repeated many times. It seems that
BM25 similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC-%2B6rjUzw7k6VeT58_8RoEFg4YEY68g443VZTTQxAPzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #7

Yes I saw Britta's slides but I find it difficult to implement my own
scoring for complex queries (ex: with AND and OR).
Do you have a concrete example or a link to share to explain with more
details the override alternative?
Thanks again Ivan,
Patrick

Le mardi 25 mars 2014 12:04:26 UTC-4, Ivan Brusic a écrit :

Did you see Britta's slides? She has a slide called "Cosine similarity as
script" which mimics the Lucene scoring as a script. You can replace the
call to _index[field][word].tf() with your own implementation. You can
deploy the script as a native Java script (note: not Javascript) for
performance.

I find it easier to understand to just change the Similarity. Simply over
DefaultSimilarity and override "public float tf(float freq)" and then
reference this similarity in your field mapping.

--
Ivan

On Tue, Mar 25, 2014 at 6:57 AM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Thanks again for the answer Ivan. Would it be simpler to modify directly
in the source code the way tf is calculated? I mean replacing somewhere
something like tf = sqrt(n) by tf = min(10,sqrt(n)).
Cheers,
Patrick

Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :

Term frequencies are stored within Lucene, so there is no calculating of
the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun agin.p...@gmail.com wrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level, but
recent version of elasticsearch allows you to access the tf-idf values in
order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.comwrote:

Hi,
If I understand well, the formula used for the term frequency part in
the default similarity module is the square root of the actual frequency.
Is it possible to modify that formula to include something like a
min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's
for documents that have the same term repeated many times. It seems that
BM25 similarity has a parameter to control saturation but I would prefer to
stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/888ccb7d-1388-4a21-a2b9-9cc1511376d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #8

I am still on a version of Elasticsearch that does not have access to the
new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the example
easier. Here is a custom similarity that I used in Elasticsearch (removes
any norms that are indexed):

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

On Tue, Mar 25, 2014 at 1:42 PM, geantbrun agin.patrick@gmail.com wrote:

Yes I saw Britta's slides but I find it difficult to implement my own
scoring for complex queries (ex: with AND and OR).
Do you have a concrete example or a link to share to explain with more
details the override alternative?
Thanks again Ivan,
Patrick

Le mardi 25 mars 2014 12:04:26 UTC-4, Ivan Brusic a écrit :

Did you see Britta's slides? She has a slide called "Cosine similarity as
script" which mimics the Lucene scoring as a script. You can replace the
call to _index[field][word].tf() with your own implementation. You can
deploy the script as a native Java script (note: not Javascript) for
performance.

I find it easier to understand to just change the Similarity. Simply over
DefaultSimilarity and override "public float tf(float freq)" and then
reference this similarity in your field mapping.

--
Ivan

On Tue, Mar 25, 2014 at 6:57 AM, geantbrun agin.p...@gmail.com wrote:

Thanks again for the answer Ivan. Would it be simpler to modify directly
in the source code the way tf is calculated? I mean replacing somewhere
something like tf = sqrt(n) by tf = min(10,sqrt(n)).
Cheers,
Patrick

Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :

Term frequencies are stored within Lucene, so there is no calculating
of the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun agin.p...@gmail.comwrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level,
but recent version of elasticsearch allows you to access the tf-idf values
in order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would need
access to what exactly are the terms which are repeated many times. Have
you looked into omitting term frequencies? It would completely bypass using
term frequencies, which might be an overkill in your case. Look into the
index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.comwrote:

Hi,
If I understand well, the formula used for the term frequency part
in the default similarity module is the square root of the actual
frequency. Is it possible to modify that formula to include something like
a min(my_max_value,sqrt(frequency))? I would like to avoid huge
tf's for documents that have the same term repeated many times. It seems
that BM25 similarity has a parameter to control saturation but I would
prefer to stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08
d-41f9-8fd4-b74ad75a6a5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/888ccb7d-1388-4a21-a2b9-9cc1511376d3%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/888ccb7d-1388-4a21-a2b9-9cc1511376d3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDeg7WOrOEgPBMNZ3%3DWUypCBUaZ_4UMw4tQ-HUJ9tuzPw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #9

Britta is looping over words that are passed as parameters. It's easy to
implement her script for a simple query but what about boolean querys? In
my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the silly
question (newbie!) but where do you put your java file? Is it the only
thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to the
new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the example
easier. Here is a custom similarity that I used in Elasticsearch (removes
any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

On Tue, Mar 25, 2014 at 1:42 PM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Yes I saw Britta's slides but I find it difficult to implement my own
scoring for complex queries (ex: with AND and OR).
Do you have a concrete example or a link to share to explain with more
details the override alternative?
Thanks again Ivan,
Patrick

Le mardi 25 mars 2014 12:04:26 UTC-4, Ivan Brusic a écrit :

Did you see Britta's slides? She has a slide called "Cosine similarity
as script" which mimics the Lucene scoring as a script. You can replace the
call to _index[field][word].tf() with your own implementation. You can
deploy the script as a native Java script (note: not Javascript) for
performance.

I find it easier to understand to just change the Similarity. Simply
over DefaultSimilarity and override "public float tf(float freq)" and then
reference this similarity in your field mapping.

--
Ivan

On Tue, Mar 25, 2014 at 6:57 AM, geantbrun agin.p...@gmail.com wrote:

Thanks again for the answer Ivan. Would it be simpler to modify
directly in the source code the way tf is calculated? I mean replacing
somewhere something like tf = sqrt(n) by tf = min(10,sqrt(n)).
Cheers,
Patrick

Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit :

Term frequencies are stored within Lucene, so there is no calculating
of the value, just a lookup in the data structure. You can disable term
frequencies and then create your own in the script, but it would be easier
to calculate that value at index time so that you can access it within your
custom score and not have to iterate through all the terms yourself. Britta
has posted on the mailing list in the past, so hopefully she will reply
with some more authoritative answers, especially ones regarding performance.

--
Ivan

On Fri, Mar 21, 2014 at 11:54 AM, geantbrun agin.p...@gmail.comwrote:

Thanks a lot Ivan, great answer.

Suppose I use in my script my own formula for tf (with
_index[field][term].tf()) and set the boost_mode to "replace", does
elasticsearch calculate the tf two times or once only? In other words, is
it computionnally efficient to calculate my own tf? Should I turn off other
calculations made by es somewhere else to avoid double calculations?

Cheers,
Patrick

Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit :

You can provide your own similarity to be used at the field level,
but recent version of elasticsearch allows you to access the tf-idf values
in order to do custom scoring [1]. Also look at Britta's recent talk on the
subject [2].

That said, either your custom similarity or custom scoring would
need access to what exactly are the terms which are repeated many times.
Have you looked into omitting term frequencies? It would completely bypass
using term frequencies, which might be an overkill in your case. Look into
the index options [3].

Finally, perhaps the common terms query can help [4].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/modules-advanced-scripting.html

[2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings

[3] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/mapping-core-types.html#string

[4] http://www.elasticsearch.org/guide/en/elasticsearch/refe
rence/current/query-dsl-common-terms-query.html

Cheers,

Ivan

On Thu, Mar 20, 2014 at 8:08 AM, geantbrun agin.p...@gmail.comwrote:

Hi,
If I understand well, the formula used for the term frequency part
in the default similarity module is the square root of the actual
frequency. Is it possible to modify that formula to include something like
a min(my_max_value,sqrt(frequency))? I would like to avoid huge
tf's for documents that have the same term repeated many times. It seems
that BM25 similarity has a parameter to control saturation but I would
prefer to stick with the simple tf/idf similarity module.
Thank you for your help
Patrick

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08
d-41f9-8fd4-b74ad75a6a5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/888ccb7d-1388-4a21-a2b9-9cc1511376d3%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/888ccb7d-1388-4a21-a2b9-9cc1511376d3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d17d767-e2c8-4e5b-a418-e8291fbf3213%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #10

I updated my gist to illustrate the SimilarityProvider that goes along with
it. Similarities are easier to add to Elasticsearch than most plugins. You
just need to compile the two files into a jar and then add that jar into
Elasticsearch's classpath ($ES_HOME/lib most likely). The code will scan
for every SimilarityProvider defined and load it.

You then mapping the similarity to a field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.patrick@gmail.com wrote:

Britta is looping over words that are passed as parameters. It's easy to
implement her script for a simple query but what about boolean querys? In
my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the silly
question (newbie!) but where do you put your java file? Is it the only
thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to the
new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC3G0s2Z2Nx%3DTzpBf_etDZEGdTr%3DA7P65zTErmo_2B7pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #11

Hi Ivan,
I followed your instructions but it does not seem to work, I must be wrong
somewhere. I created the jar file from the following two java files, could
you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, @Assisted 

Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes along
with it. Similarities are easier to add to Elasticsearch than most plugins.
You just need to compile the two files into a jar and then add that jar
into Elasticsearch's classpath ($ES_HOME/lib most likely). The code will
scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Britta is looping over words that are passed as parameters. It's easy to
implement her script for a simple query but what about boolean querys? In
my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the silly
question (newbie!) but where do you put your java file? Is it the only
thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to
the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/680df8c0-6621-4184-87b6-50a955bccae3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #12

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping creation:

{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[tfCappedSimilarity]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider];
","status":500}]

Note that the provider is referred in the error as tfCappedSimilaritySimilarityProvider
(similarity repeated 2 times). Is it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be wrong
somewhere. I created the jar file from the following two java files, could
you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends AbstractSimilarityProvider
{

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, @Assisted 

Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes along
with it. Similarities are easier to add to Elasticsearch than most plugins.
You just need to compile the two files into a jar and then add that jar
into Elasticsearch's classpath ($ES_HOME/lib most likely). The code will
scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.com wrote:

Britta is looping over words that are passed as parameters. It's easy to
implement her script for a simple query but what about boolean querys? In
my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the silly
question (newbie!) but where do you put your java file? Is it the only
thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to
the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #13

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.patrick@gmail.com wrote:

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping creation:

{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[tfCappedSimilarity]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider];
","status":500}]

Note that the provider is referred in the error as tfCappedSimilaritySimilarityProvider
(similarity repeated 2 times). Is it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name,

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes along
with it. Similarities are easier to add to Elasticsearch than most plugins.
You just need to compile the two files into a jar and then add that jar
into Elasticsearch's classpath ($ES_HOME/lib most likely). The code will
scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.
elasticsearch.org/guide/en/elasticsearch/reference/
current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/mapping-core-types.html#configuring
similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.com wrote:

Britta is looping over words that are passed as parameters. It's easy
to implement her script for a simple query but what about boolean querys?
In my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the silly
question (newbie!) but where do you put your java file? Is it the only
thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to
the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something
like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCMigowr_z-Dko_LP05h%3DXtFTV-czmv_n%2Be%2B%2B7md48%3DtA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #14

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed",
"similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the error
is the same except that provider is referred as tfCappedSimilarityProviderand not as
tfCappedSimilaritySimilarityProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping creation:

{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[tfCappedSimilarity]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider];
","status":500}]

Note that the provider is referred in the error as tfCappedSimilaritySimilarityProvider
(similarity repeated 2 times). Is it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, 

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes along
with it. Similarities are easier to add to Elasticsearch than most plugins.
You just need to compile the two files into a jar and then add that jar
into Elasticsearch's classpath ($ES_HOME/lib most likely). The code will
scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.
elasticsearch.org/guide/en/elasticsearch/reference/
current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/mapping-core-types.html#configuring
similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's easy
to implement her script for a simple query but what about boolean querys?
In my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access to
the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something
like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #15

It has been a while since I used a custom similarity, but what you have
looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.
similarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun agin.patrick@gmail.com wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the error
is the same except that provider is referred as tfCappedSimilarityProviderand not as
tfCappedSimilaritySimilarityProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.com wrote:

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting [type]
with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.
tfCappedSimilaritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2 times). Is
it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name,

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes along
with it. Similarities are easier to add to Elasticsearch than most plugins.
You just need to compile the two files into a jar and then add that jar
into Elasticsearch's classpath ($ES_HOME/lib most likely). The code will
scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.elasticsearc
h.org/guide/en/elasticsearch/reference/current/mapping-core-types.
html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's easy
to implement her script for a simple query but what about boolean querys?
In my understanding (but I could be wrong of course), I would have to parse
the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access
to the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something
like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to implement
instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD9iNsZvK_hEx6BZ2gT0r3N79djoE5w1acDHFMY93n9fQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #16

In order to better understand the error, I copied your
NormRemovalSimilarity and NormRemovalSimilarityProvider code snippets in
usr/share/elasticsearch/lib. I put these 2 files in a jar named
NormRemovalSimilarity.jar. After restarting the elasticsearch service, I
tried to create the index with the same mapping as before (except that I
put "type" : "NormRemoval" in the settings of my_similarity.

The result is the same:
{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[NormRemoval]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.normremoval.NormRemovalSimilarityProvider];
","status":500}]

I deleted the jar file just to see if the error is the same: yes it is.
It's like the new similarity is never found or loaded. Is it still working
without modifications on your side?
Cheers,
Patrick

Le mercredi 2 avril 2014 00:31:44 UTC-4, Ivan Brusic a écrit :

It has been a while since I used a custom similarity, but what you have
looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.
similarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the
error is the same except that provider is referred as
tfCappedSimilarityProvider and not as tfCappedSimilaritySimilarit
yProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.com wrote:

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting [type]
with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.
tfCappedSimilaritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2 times). Is
it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, 

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes
along with it. Similarities are easier to add to Elasticsearch than most
plugins. You just need to compile the two files into a jar and then add
that jar into Elasticsearch's classpath ($ES_HOME/lib most likely). The
code will scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.elasticsearc
h.org/guide/en/elasticsearch/reference/current/mapping-core-types.
html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's
easy to implement her script for a simple query but what about boolean
querys? In my understanding (but I could be wrong of course), I would have
to parse the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access
to the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something
like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to
implement instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #17

Are you using a full class name? I have no problems with

curl -XPOST 'http://localhost:9200/sim/' -d '
{
"settings" : {
"similarity" : {
"my_similarity" : {
"type" :
"org.elasticsearch.index.similarity.NormRemovalSimilarityProvider"
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed",
"similarity" : "my_similarity"}
}
}
}
}
'

On Wed, Apr 2, 2014 at 12:03 PM, geantbrun agin.patrick@gmail.com wrote:

In order to better understand the error, I copied your
NormRemovalSimilarity and NormRemovalSimilarityProvider code snippets in
usr/share/elasticsearch/lib. I put these 2 files in a jar named
NormRemovalSimilarity.jar. After restarting the elasticsearch service, I
tried to create the index with the same mapping as before (except that I
put "type" : "NormRemoval" in the settings of my_similarity.

The result is the same:
{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[NormRemoval]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.normremoval.NormRemovalSimilarityProvider];
","status":500}]

I deleted the jar file just to see if the error is the same: yes it is.
It's like the new similarity is never found or loaded. Is it still working
without modifications on your side?
Cheers,
Patrick

Le mercredi 2 avril 2014 00:31:44 UTC-4, Ivan Brusic a écrit :

It has been a while since I used a custom similarity, but what you have
looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.si
milarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun agin.p...@gmail.com wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the
error is the same except that provider is referred as
tfCappedSimilarityProvider and not as tfCappedSimilaritySimilarit
yProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.comwrote:

I realize that I probably have to define the similarity property of my
field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping
creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting [type]
with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSim
ilaritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2 times). Is
it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name,

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes
along with it. Similarities are easier to add to Elasticsearch than most
plugins. You just need to compile the two files into a jar and then add
that jar into Elasticsearch's classpath ($ES_HOME/lib most likely). The
code will scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.elasticsearc
h.org/guide/en/elasticsearch/reference/current/mapping-core-types.
html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's
easy to implement her script for a simple query but what about boolean
querys? In my understanding (but I could be wrong of course), I would have
to parse the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have access
to the new scoring capabilities, so I cannot test out any scripts. The non
normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with something
like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find the
example easier. Here is a custom similarity that I used in Elasticsearch
(removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to
implement instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB1-qb8bAYyG4hnFKhyPyKyYGrdUwfFLQBr%2BatN5reXLg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #18

Thanks again for your great help Ivan. Does not work for me. When I
substitute NormRemovalSimilarityProvider by BM25SimilarityProvider (or
simply by BM25), it works. Is it possible that I put my jar file in the
wrong directory (usr/share/elasticsearch/lib)? Is it necessary to
register somewhere the new classes I define before restarting service?
Cheers,
Patrick

Le mercredi 2 avril 2014 17:47:46 UTC-4, Ivan Brusic a écrit :

Are you using a full class name? I have no problems with

curl -XPOST 'http://localhost:9200/sim/' -d '
{
"settings" : {
"similarity" : {
"my_similarity" : {
"type" :
"org.elasticsearch.index.similarity.NormRemovalSimilarityProvider"
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}
'

On Wed, Apr 2, 2014 at 12:03 PM, geantbrun <agin.p...@gmail.com<javascript:>

wrote:

In order to better understand the error, I copied your
NormRemovalSimilarity and NormRemovalSimilarityProvider code snippets in
usr/share/elasticsearch/lib. I put these 2 files in a jar named
NormRemovalSimilarity.jar. After restarting the elasticsearch service, I
tried to create the index with the same mapping as before (except that I
put "type" : "NormRemoval" in the settings of my_similarity.

The result is the same:
{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[NormRemoval]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.normremoval.NormRemovalSimilarityProvider];
","status":500}]

I deleted the jar file just to see if the error is the same: yes it is.
It's like the new similarity is never found or loaded. Is it still working
without modifications on your side?
Cheers,
Patrick

Le mercredi 2 avril 2014 00:31:44 UTC-4, Ivan Brusic a écrit :

It has been a while since I used a custom similarity, but what you have
looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.si
milarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun agin.p...@gmail.com wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" :
"analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the
error is the same except that provider is referred as
tfCappedSimilarityProvider and not as tfCappedSimilaritySimilarit
yProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.comwrote:

I realize that I probably have to define the similarity property of
my field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping
creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting [type]
with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSim
ilaritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2 times). Is
it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must be
wrong somewhere. I created the jar file from the following two java files,
could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, 

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes
along with it. Similarities are easier to add to Elasticsearch than most
plugins. You just need to compile the two files into a jar and then add
that jar into Elasticsearch's classpath ($ES_HOME/lib most likely). The
code will scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.elasticsearc
h.org/guide/en/elasticsearch/reference/current/mapping-core-types.
html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_per_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's
easy to implement her script for a simple query but what about boolean
querys? In my understanding (but I could be wrong of course), I would have
to parse the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have
access to the new scoring capabilities, so I cannot test out any scripts.
The non normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with
something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find
the example easier. Here is a custom similarity that I used in
Elasticsearch (removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to
implement instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c01104c1-5219-4616-802c-fd1680a4c8c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(geantbrun) #19

Ivan,
Sorry but I realize (I'm totally unaware of Java) that I skipped the java
compile step (I simply put the java files in a jar file with jar cf). The
problem now is that executing :

javac NormRemovalSimilarity.java -classpath ./elasticsearch-1.1.0.jar

generates errors, the first one being:

package org.apache.lucene.search.similarities does not exist

Googled it but found nothing. Any idea?
Patrick

P.S. I installed elasticsearch following the easy wayhttps://gist.github.com/wingdspur/2026107(dpkg the deb file)

Le jeudi 3 avril 2014 09:16:02 UTC-4, geantbrun a écrit :

Thanks again for your great help Ivan. Does not work for me. When I
substitute NormRemovalSimilarityProvider by BM25SimilarityProvider (or
simply by BM25), it works. Is it possible that I put my jar file in the
wrong directory (usr/share/elasticsearch/lib)? Is it necessary to
register somewhere the new classes I define before restarting service?
Cheers,
Patrick

Le mercredi 2 avril 2014 17:47:46 UTC-4, Ivan Brusic a écrit :

Are you using a full class name? I have no problems with

curl -XPOST 'http://localhost:9200/sim/' -d '
{
"settings" : {
"similarity" : {
"my_similarity" : {
"type" :
"org.elasticsearch.index.similarity.NormRemovalSimilarityProvider"
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}
'

On Wed, Apr 2, 2014 at 12:03 PM, geantbrun agin.p...@gmail.com wrote:

In order to better understand the error, I copied your
NormRemovalSimilarity and NormRemovalSimilarityProvider code snippets in
usr/share/elasticsearch/lib. I put these 2 files in a jar named
NormRemovalSimilarity.jar. After restarting the elasticsearch service, I
tried to create the index with the same mapping as before (except that I
put "type" : "NormRemoval" in the settings of my_similarity.

The result is the same:
{"error":"IndexCreationException[[exbd] failed to create index]; nested:
NoClassSettingsException[Failed to load class setting [type] with value
[NormRemoval]]; nested:
ClassNotFoundException[org.elasticsearch.index.similarity.normremoval.NormRemovalSimilarityProvider];
","status":500}]

I deleted the jar file just to see if the error is the same: yes it is.
It's like the new similarity is never found or loaded. Is it still working
without modifications on your side?
Cheers,
Patrick

Le mercredi 2 avril 2014 00:31:44 UTC-4, Ivan Brusic a écrit :

It has been a while since I used a custom similarity, but what you have
looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.si
milarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun agin.p...@gmail.com wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0"
},
"name" : { "type" : "string", "store" : "yes", "index" :
"analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the
error is the same except that provider is referred as
tfCappedSimilarityProvider and not as tfCappedSimilaritySimilarit
yProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.comwrote:

I realize that I probably have to define the similarity property of
my field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping
creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting
[type] with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSim
ilaritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2
times). Is it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must
be wrong somewhere. I created the jar file from the following two java
files, could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name, 

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a word
repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes
along with it. Similarities are easier to add to Elasticsearch than most
plugins. You just need to compile the two files into a jar and then add
that jar into Elasticsearch's classpath ($ES_HOME/lib most likely). The
code will scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.
elasticsearch.org/guide/en/elasticsearch/reference/
current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_pe
r_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's
easy to implement her script for a simple query but what about boolean
querys? In my understanding (but I could be wrong of course), I would have
to parse the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for the
silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have
access to the new scoring capabilities, so I cannot test out any scripts.
The non normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with
something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find
the example easier. Here is a custom similarity that I used in
Elasticsearch (removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to
implement instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6370b4dc-824
3-4aea-918a-e4e4e9588aaf%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25ca773c-17fc-4b03-aaf7-58464f6a6885%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #20

I added a simple Maven pom to the gist:

Easiest thing to do is download Maven (if you do not have it) and use it
take care handling the dependencies and build a jar if you simple execute:
mvn package

Since Elasticsearch already comes bundle with the correct jars, you can
also add those to your classpath instead. I think you only need Lucene
core, which is in $ES_HOME/lib/lucene-core-4-?-?.jar Substitute the
question marks for the correct version. I am not on Elasticsearch, so I do
not know offhand which version of Lucene is packaged.

--
Ivan

On Thu, Apr 3, 2014 at 7:44 AM, geantbrun agin.patrick@gmail.com wrote:

Ivan,
Sorry but I realize (I'm totally unaware of Java) that I skipped the java
compile step (I simply put the java files in a jar file with jar cf). The
problem now is that executing :

javac NormRemovalSimilarity.java -classpath ./elasticsearch-1.1.0.jar

generates errors, the first one being:

package org.apache.lucene.search.similarities does not exist

Googled it but found nothing. Any idea?
Patrick

P.S. I installed elasticsearch following the easy wayhttps://gist.github.com/wingdspur/2026107(dpkg the deb file)

Le jeudi 3 avril 2014 09:16:02 UTC-4, geantbrun a écrit :

Thanks again for your great help Ivan. Does not work for me. When I
substitute NormRemovalSimilarityProvider by BM25SimilarityProvider (or
simply by BM25), it works. Is it possible that I put my jar file in the
wrong directory (usr/share/elasticsearch/lib)? Is it necessary to
register somewhere the new classes I define before restarting service?
Cheers,
Patrick

Le mercredi 2 avril 2014 17:47:46 UTC-4, Ivan Brusic a écrit :

Are you using a full class name? I have no problems with

curl -XPOST 'http://localhost:9200/sim/' -d '
{
"settings" : {
"similarity" : {
"my_similarity" : {
"type" : "org.elasticsearch.index.similarity.
NormRemovalSimilarityProvider"
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}
'

On Wed, Apr 2, 2014 at 12:03 PM, geantbrun agin.p...@gmail.com wrote:

In order to better understand the error, I copied your
NormRemovalSimilarity and NormRemovalSimilarityProvider code snippets in
usr/share/elasticsearch/lib. I put these 2 files in a jar named
NormRemovalSimilarity.jar. After restarting the elasticsearch service, I
tried to create the index with the same mapping as before (except that I
put "type" : "NormRemoval" in the settings of my_similarity.

The result is the same:
{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting [type]
with value [NormRemoval]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.normremoval.
NormRemovalSimilarityProvider]; ","status":500}]

I deleted the jar file just to see if the error is the same: yes it is.
It's like the new similarity is never found or loaded. Is it still working
without modifications on your side?
Cheers,
Patrick

Le mercredi 2 avril 2014 00:31:44 UTC-4, Ivan Brusic a écrit :

It has been a while since I used a custom similarity, but what you
have looks right. Can you try a full class name instead?
Use org.elasticsearch.index.similarity.tfCappedSimilarityProvider.
According to the error, it is looking for org.elasticsearch.index.si
milarity.tfcappedsimilarity.tfCappedSimilaritySimilarityProvider.

--
Ivan

On Tue, Apr 1, 2014 at 7:00 AM, geantbrun agin.p...@gmail.com wrote:

Sure.

{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "tfCappedSimilarity"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0"
},
"name" : { "type" : "string", "store" : "yes", "index" :
"analyzed"},
"contents" : { "type" : "string", "store" : "no", "index" :
"analyzed", "similarity" : "my_similarity"}
}
}
}
}

If I substitute tfCappedSimilarity for tfCapped in the mapping, the
error is the same except that provider is referred as
tfCappedSimilarityProvider and not as tfCappedSimilaritySimilarit
yProvider.
Cheers,
Patrick

Le lundi 31 mars 2014 17:13:24 UTC-4, Ivan Brusic a écrit :

Can you also post your mapping where you defined the similarity?

--
Ivan

On Mon, Mar 31, 2014 at 10:36 AM, geantbrun agin.p...@gmail.comwrote:

I realize that I probably have to define the similarity property of
my field as "my_similarity" (and not as "tfCappedSimilarity") and define in
the settings my_similarity as being of type tfCappedSimilarity.
When I do that, I get the following error at the index/mapping
creation:

{"error":"IndexCreationException[[exbd] failed to create index];
nested: NoClassSettingsException[Failed to load class setting
[type] with value [tfCappedSimilarity]]; nested: ClassNotFoundException[org.
elasticsearch.index.similarity.tfcappedsimilarity.tfCappedSimil
aritySimilarityProvider]; ","status":500}]

Note that the provider is referred in the error as
tfCappedSimilaritySimilarityProvider (similarity repeated 2
times). Is it normal?
Patrick

Le lundi 31 mars 2014 13:06:00 UTC-4, geantbrun a écrit :

Hi Ivan,
I followed your instructions but it does not seem to work, I must
be wrong somewhere. I created the jar file from the following two java
files, could you tell me if they are ok?

tfCappedSimilarity.java


package org.elasticsearch.index.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;

public class tfCappedSimilarity extends DefaultSimilarity {

    private ESLogger logger;

    public tfCappedSimilarity() {
            logger = Loggers.getLogger(getClass());
    }

    /**
     * Capped tf value
     */
    @Override
    public float tf(float freq) {
            return (float)Math.sqrt(Math.min(9, freq));
    }

}

tfCappedSimilarityProvider.java


package org.elasticsearch.index.similarity;

import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;

public class tfCappedSimilarityProvider extends
AbstractSimilarityProvider {

    private tfCappedSimilarity similarity;

    @Inject
    public tfCappedSimilarityProvider(@Assisted String name,

@Assisted Settings settings) {
super(name);
this.similarity = new tfCappedSimilarity();
}

    /**
     * {@inheritDoc}
     */
    @Override
    public tfCappedSimilarity get() {
            return similarity;
    }

}

In my mapping, I define the similarity property of my field as
tfCappedSimilarity, is it ok?

What makes me say that it does not work: I insert a doc with a
word repeated 16 times in my field. When I do a search with that word, the
result shows a tf of 4 (square root of 16) and not 3 as I was expecting, Is
there a way to know if the similarity was loaded or not (maybe in a log
file?).

Cheers,
Patrick

Le mercredi 26 mars 2014 17:16:36 UTC-4, Ivan Brusic a écrit :

I updated my gist to illustrate the SimilarityProvider that goes
along with it. Similarities are easier to add to Elasticsearch than most
plugins. You just need to compile the two files into a jar and then add
that jar into Elasticsearch's classpath ($ES_HOME/lib most likely). The
code will scan for every SimilarityProvider defined and load it.

You then mapping the similarity to a field: http://www.
elasticsearch.org/guide/en/elasticsearch/reference/
current/mapping-core-types.html#_configuring_similarity_per_field

Note that you cannot change the similarity of a field dynamically.

Ivan

http://www.elasticsearch.org/guide/en/elasticsearch/referenc
e/current/mapping-core-types.html#_configuring_similarity_pe
r_field

On Wed, Mar 26, 2014 at 12:49 PM, geantbrun agin.p...@gmail.comwrote:

Britta is looping over words that are passed as parameters. It's
easy to implement her script for a simple query but what about boolean
querys? In my understanding (but I could be wrong of course), I would have
to parse the query to call the script with each sub-clause, am I wrong?

I prefer your custom similarity alternative. Again, sorry for
the silly question (newbie!) but where do you put your java file? Is it the
only thing that is needed (except for the modification in the mapping)?
cheers,
Patrick

Le mercredi 26 mars 2014 11:58:52 UTC-4, Ivan Brusic a écrit :

I am still on a version of Elasticsearch that does not have
access to the new scoring capabilities, so I cannot test out any scripts.
The non normalized term frequency should be the line:
tf = _index[field][word].tf()

If that is the case, you could substitute that line with
something like:
tf = Math.min(10, _index[field][word].tf())

As a stated before, I am used to using Similarities, so I find
the example easier. Here is a custom similarity that I used in
Elasticsearch (removes any norms that are indexed):
https://gist.github.com/brusic/9786587

The second part would be the tf() method you would need to
implement instead of decodeNormValue I used.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6370b4dc-824
3-4aea-918a-e4e4e9588aaf%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6370b4dc-8243-4aea-918a-e4e4e9588aaf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40goo
glegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f9c6111c-9c4a-427d-952e-a203f2376fb8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68488979-9153-430b-b349-2192717677e7%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/25ca773c-17fc-4b03-aaf7-58464f6a6885%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/25ca773c-17fc-4b03-aaf7-58464f6a6885%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCkOMr1-48mgnFPTs-38GswX-OfU%3DgBLY9Qr3n1Z-9p0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.