How to change similarity without actual code

shlomivaknin · February 20, 2013, 4:31pm

Hey,

I am fresh to ES, and i have a task that i dont know what approach is best
to take.

our data is a simple line of text and some number fields, and our queries
are only on the line of text.
when I query a few terms, (as far as i understand) the score gets
calculated in such way, that prefers multiple occurrences of terms in the
text, and also prefers longer matches.

if i would want to change that (say, dont mind how many times a term
appeared, and dont mind the length), i would write this in lucene:

public class MySimilarity extends DefaultSimilarity {

@Override

//We don't care about how many times a term appears in the text

public float tf(float freq) {

    return freq == 0 ? 0 : 1;

}   

@Override

 public float computeNorm(String field, FieldInvertState state) {

    return state.getBoost();    //ignore length factor

}

}

now my question is - is there a way to do this kind of things in ES, so i
dont have to actually write code, ie use the dsl?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 20, 2013, 10:41pm

Hey,

for the Term Frequency part I would just recommend to omit the TF and index
document ids only. this will effectively what you showed in your example
just without the "branch" in the similarity. (see 'index_options' here:
Elasticsearch Platform — Find real-time answers at scale | Elastic)
For the length normalization I'd omitNorms ('omit_norms'
here: Elasticsearch Platform — Find real-time answers at scale | Elastic)
in the mapping and use a custom score like shown
here: Elasticsearch Platform — Find real-time answers at scale | Elastic

this should be equivalent to what you want and you can influence how much
weight the boost gets at runtime.

simon

On Wednesday, February 20, 2013 5:31:54 PM UTC+1, Shlomi wrote:

Hey,

I am fresh to ES, and i have a task that i dont know what approach is best
to take.

our data is a simple line of text and some number fields, and our queries
are only on the line of text.
when I query a few terms, (as far as i understand) the score gets
calculated in such way, that prefers multiple occurrences of terms in the
text, and also prefers longer matches.

if i would want to change that (say, dont mind how many times a term
appeared, and dont mind the length), i would write this in lucene:

public class MySimilarity extends DefaultSimilarity {
@Override

//We don't care about how many times a term appears in the text

public float tf(float freq) {

    return freq == 0 ? 0 : 1;

}   

@Override

 public float computeNorm(String field, FieldInvertState state) {

    return state.getBoost();    //ignore length factor

}  
}

now my question is - is there a way to do this kind of things in ES, so i
dont have to actually write code, ie use the dsl?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · February 21, 2013, 10:02am

Hey

Thank you for your response,

"omit_norms" seemed to do the job right, but "index_options" set to "docs"
made searches that are not direct term unavailable, meaning i couldnt do
query_string like: +bre* -break* AND "Tons of"

I tried my luck with unique token filtershttp://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html,
from what i understood (from the really short description), it should give
me similar results, am I correct?

On Thursday, February 21, 2013 12:41:35 AM UTC+2, simonw wrote:

Hey,

for the Term Frequency part I would just recommend to omit the TF and
index document ids only. this will effectively what you showed in your
example just without the "branch" in the similarity. (see 'index_options'
here: Elasticsearch Platform — Find real-time answers at scale | Elastic
)
For the length normalization I'd omitNorms ('omit_norms' here:
Elasticsearch Platform — Find real-time answers at scale | Elastic) in
the mapping and use a custom score like shown here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

this should be equivalent to what you want and you can influence how much
weight the boost gets at runtime.

simon

On Wednesday, February 20, 2013 5:31:54 PM UTC+1, Shlomi wrote:
Hey,

I am fresh to ES, and i have a task that i dont know what approach is
best to take.

our data is a simple line of text and some number fields, and our queries
are only on the line of text.
when I query a few terms, (as far as i understand) the score gets
calculated in such way, that prefers multiple occurrences of terms in the
text, and also prefers longer matches.

if i would want to change that (say, dont mind how many times a term
appeared, and dont mind the length), i would write this in lucene:

public class MySimilarity extends DefaultSimilarity {
@Override

//We don't care about how many times a term appears in the text

public float tf(float freq) {

    return freq == 0 ? 0 : 1;

}   

@Override

 public float computeNorm(String field, FieldInvertState state) {

    return state.getBoost();    //ignore length factor

}  
}

now my question is - is there a way to do this kind of things in ES, so i
dont have to actually write code, ie use the dsl?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 22, 2013, 10:39pm

On Thursday, February 21, 2013 11:02:15 AM UTC+1, Shlomi wrote:

Hey

Thank you for your response,

"omit_norms" seemed to do the job right, but "index_options" set to "docs"
made searches that are not direct term unavailable, meaning i couldnt do
query_string like: +bre* -break* AND "Tons of"

ah I see yeah setting this to "docs" will drop positions and queries like
"Tons of" won't work anymore. UniqueTokenFitler should do the job here!

simon

I tried my luck with unique token filtershttp://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html,
from what i understood (from the really short description), it should give
me similar results, am I correct?

On Thursday, February 21, 2013 12:41:35 AM UTC+2, simonw wrote:
Hey,

for the Term Frequency part I would just recommend to omit the TF and
index document ids only. this will effectively what you showed in your
example just without the "branch" in the similarity. (see 'index_options'
here:
Elasticsearch Platform — Find real-time answers at scale | Elastic)
For the length normalization I'd omitNorms ('omit_norms' here:
Elasticsearch Platform — Find real-time answers at scale | Elastic) in
the mapping and use a custom score like shown here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

this should be equivalent to what you want and you can influence how much
weight the boost gets at runtime.

simon

On Wednesday, February 20, 2013 5:31:54 PM UTC+1, Shlomi wrote:
Hey,

I am fresh to ES, and i have a task that i dont know what approach is
best to take.

our data is a simple line of text and some number fields, and our
queries are only on the line of text.
when I query a few terms, (as far as i understand) the score gets
calculated in such way, that prefers multiple occurrences of terms in the
text, and also prefers longer matches.

if i would want to change that (say, dont mind how many times a term
appeared, and dont mind the length), i would write this in lucene:

public class MySimilarity extends DefaultSimilarity {
@Override

//We don't care about how many times a term appears in the text

public float tf(float freq) {

    return freq == 0 ? 0 : 1;

}   

@Override

 public float computeNorm(String field, FieldInvertState state) {

    return state.getBoost();    //ignore length factor

}  
}

now my question is - is there a way to do this kind of things in ES, so
i dont have to actually write code, ie use the dsl?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · February 24, 2013, 3:34pm

thanks!

So I tired that, and it worked fine until i tried to query something like
"bye bye", which was not distinguishable from "bye" (as opposed with "bye
now" for instance)..

of course i could do shingle token filter, but that would needlessly
enlarge my index size..

any other suggestions?

On Saturday, February 23, 2013 12:39:47 AM UTC+2, simonw wrote:

On Thursday, February 21, 2013 11:02:15 AM UTC+1, Shlomi wrote:

Hey

Thank you for your response,

"omit_norms" seemed to do the job right, but "index_options" set to
"docs" made searches that are not direct term unavailable, meaning i
couldnt do query_string like: +bre* -break* AND "Tons of"

ah I see yeah setting this to "docs" will drop positions and queries like
"Tons of" won't work anymore. UniqueTokenFitler should do the job here!

simon
I tried my luck with unique token filtershttp://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html,
from what i understood (from the really short description), it should give
me similar results, am I correct?

On Thursday, February 21, 2013 12:41:35 AM UTC+2, simonw wrote:
Hey,

for the Term Frequency part I would just recommend to omit the TF and
index document ids only. this will effectively what you showed in your
example just without the "branch" in the similarity. (see 'index_options'
here:
Elasticsearch Platform — Find real-time answers at scale | Elastic)
For the length normalization I'd omitNorms ('omit_norms' here:
Elasticsearch Platform — Find real-time answers at scale | Elastic)
in the mapping and use a custom score like shown here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

this should be equivalent to what you want and you can influence how
much weight the boost gets at runtime.

simon

On Wednesday, February 20, 2013 5:31:54 PM UTC+1, Shlomi wrote:
Hey,

I am fresh to ES, and i have a task that i dont know what approach is
best to take.

our data is a simple line of text and some number fields, and our
queries are only on the line of text.
when I query a few terms, (as far as i understand) the score gets
calculated in such way, that prefers multiple occurrences of terms in the
text, and also prefers longer matches.

if i would want to change that (say, dont mind how many times a term
appeared, and dont mind the length), i would write this in lucene:

public class MySimilarity extends DefaultSimilarity {
@Override

//We don't care about how many times a term appears in the text

public float tf(float freq) {

    return freq == 0 ? 0 : 1;

}   

@Override

 public float computeNorm(String field, FieldInvertState state) {

    return state.getBoost();    //ignore length factor

}  
}

now my question is - is there a way to do this kind of things in ES, so
i dont have to actually write code, ie use the dsl?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Changing similarities Elasticsearch	1	306	July 6, 2017
How to modify term frequency formula? Elasticsearch	25	3420	July 6, 2017
Newbie quesiton re: document size & score Elasticsearch	3	334	July 6, 2017
Score based on Term Frequency alone Elasticsearch	2	3922	May 23, 2017
Suggestions on how to tweak search accuracy Elasticsearch	8	2170	July 6, 2017

How to change similarity without actual code

Related topics