Ignore term frequency (not releveant for the type of document I'm using)

Hi,

How do I get ES to ignore the term frequency since it is not releveant for
the type of document I'm using?

I'm using two ES types to handle two kind of data that need different
analyzers.
I'm trying to query the 2 types using a multi_match but I would like to
ignore the term frequency.
I tried using "index_options" : "docs" on my fields but I'm still getting
different scores depending on the term frequency.

Mapping:
curl -XPOST "localhost:9200/myindex" -d '
{
"settings":{
"index":{
"analysis":{
"filter" : {
"name_nGram" : {
"max_gram" : 100,
"min_gram" : 2,
"type" : "edge_ngram"
},
"strip_hydrid_sign_filter":{
"pattern":"\u00D7",
"replacement":"",
"type": "pattern_replace"
}
},
"analyzer":{
"name_index" : {
"filter" : [
"lowercase","asciifolding","name_nGram"
],
"tokenizer" : "keyword"
},
"full_name_index" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_index" : {
"filter" : [
"lowercase","asciifolding","strip_hydrid_sign_filter","name_nGram"
],
"tokenizer" : "keyword"
},
"name_search" : {
"filter" : [
"lowercase","asciifolding"
],
"tokenizer" : "keyword"
},
"scientificname_search" : {
"filter" : [
"lowercase","asciifolding","strip_hydrid_sign_filter"
],
"tokenizer" : "keyword"
}
}
}
}
},
"mappings" : {
"taxon" : {
"properties" : {
"taxonname" : {
"type" : "multi_field",
"fields":{
"taxonname":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search",
"omit_norms" : true,
"index_options" : "docs"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "scientificname_index",
"search_analyzer" : "scientificname_search",
"omit_norms" : true,
"index_options" : "docs"
}
}
}
}
},
"vernacular" : {
"properties" : {
"vernacularname" : {
"type" : "multi_field",
"fields":{
"vernacularname":{
"type" : "string",
"index_analyzer" : "full_name_index",
"search_analyzer" : "name_search",
"omit_norms" : true,
"index_options" : "docs"
},
"ngrams":{
"type" : "string",
"index_analyzer" : "name_index",
"search_analyzer" : "name_search",
"omit_norms" : true,
"index_options" : "docs"
}
}
}
}
}
}
}'

Data:
curl -XPUT "localhost:9200/myindex/taxon/1" -d '{
"taxonname":"Carex capitata"
}'
curl -XPUT "localhost:9200/myindex/taxon/2" -d '{
"taxonname":"Carex heleonastes"
}'
curl -XPUT "localhost:9200/myindex/taxon/3" -d '{
"taxonname":"Carex buckleyi"
}'

curl -XPUT "localhost:9200/myindex/vernacular/1" -d '{
"vernacularname":"carex de Richardson"
}'
curl -XPUT "localhost:9200/myindex/vernacular/2" -d '{
"vernacularname":"carex du lac Tahoe"
}'

Query:
curl
"localhost:9200/myindex/_search?search_type=dfs_query_then_fetch&pretty=1" -d
'{
"query":{
"bool":{
"should":[
{
"multi_match" : {
"query" : "carex",
"fields" : [ "taxonname", "taxonname.ngrams" ]
}
},
{
"multi_match" : {
"query" : "carex",
"fields" : ["vernacularname", "vernacularname.ngrams"]
}
}
]
}
}
}'

This would give a better score for vernacularname than taxonname since they
have different term frequency.

So, how can I ignore the term frequency so vernacularname and taxonname
would have the same score or, there is a better way to achieve that?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Inverse document frequencies also play a role when scoring with the default
similarity. You can ignore the scoring of a query by wrapping it inside a
constant score query[1]. Does it help? Another option would be to write a
custom similarity extending the default one that would always return 1 for
the idf.

[1]
http://www.elasticsearch.org/guide/reference/query-dsl/constant-score-query/

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Adrien,
Indeed, the "explain" returns idf(docFreq=1334, maxDocs=57595) for the type
taxon and idf(docFreq=366, maxDocs=57595)for the type vernacular so I guess
the Inverse document frequency is the main reason.
I'm not sure I understand the purpose of the constant score query.
Actually, I want to have the score to sort them by relevance (ngrams
fields) but I don't need the idf since the document frequency is not
relevant in this specific context.

I guess the custom similarity should be something like that
: GitHub - tlrx/elasticsearch-custom-similarity-provider: A custom SimilarityProvider example for Elasticsearch

Thanks,

Christian

On Friday, July 26, 2013 5:50:41 AM UTC-4, Adrien Grand wrote:

Hi,

Inverse document frequencies also play a role when scoring with the
default similarity. You can ignore the scoring of a query by wrapping it
inside a constant score query[1]. Does it help? Another option would be to
write a custom similarity extending the default one that would always
return 1 for the idf.

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

On Fri, Jul 26, 2013 at 2:08 PM, Christian Gendreau <
christiangendreau@gmail.com> wrote:

I'm not sure I understand the purpose of the constant score query.
Actually, I want to have the score to sort them by relevance (ngrams
fields) but I don't need the idf since the document frequency is not
relevant in this specific context.

I wanted to mention that if you run a boolean query with two clauses which
are term queries wrapped into constant score queries, the TF-IDF won't be
involved in the scoring, the best documents will be those which have the
higher number of matching clauses.

I guess the custom similarity should be something like that :
GitHub - tlrx/elasticsearch-custom-similarity-provider: A custom SimilarityProvider example for Elasticsearch

Exactly, you can even override tf(float freq) to something like "return
freq > 0 ? 1 : 0;" if you don't want to take into account the term
frequency either.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I tried that:
curl
"localhost:9200/myindex/_search?search_type=dfs_query_then_fetch&pretty=1"
-d '{
"query":{
"bool":{
"should":[
{
"constant_score" : {
"query" : {
"match":{
"taxonname":{
"query":"carex"
}
}
},
"boost" : 1
}
},
{
"constant_score" : {
"query" : {
"match":{
"taxonname.ngrams":{
"query":"carex"
}
}
},
"boost" : 1
}
}
,{
"constant_score" : {
"query" : {
"match":{
"vernacularname":{
"query":"carex"
}
}
},
"boost" : 1
}
},
{
"constant_score" : {
"query" : {
"match":{
"vernacularname.ngrams":{
"query":"carex"
}
}
},
"boost" : 1
}
}
]
}
},
"size" : 100,
"sort" : [
"_score",
{ "sortname" : {"order" : "asc"} }
]
}'

I'm not sure if this is exactly what you meant but it seems to work!
I'm also not sure if this is the most efficient way to do this or the
custom similarity would perform better.

Thanks for your help,

Christian

On Friday, July 26, 2013 12:46:21 PM UTC-4, Adrien Grand wrote:

Hi,

On Fri, Jul 26, 2013 at 2:08 PM, Christian Gendreau <christia...@gmail.com<javascript:>

wrote:

I'm not sure I understand the purpose of the constant score query.
Actually, I want to have the score to sort them by relevance (ngrams
fields) but I don't need the idf since the document frequency is not
relevant in this specific context.

I wanted to mention that if you run a boolean query with two clauses which
are term queries wrapped into constant score queries, the TF-IDF won't be
involved in the scoring, the best documents will be those which have the
higher number of matching clauses.

I guess the custom similarity should be something like that :
GitHub - tlrx/elasticsearch-custom-similarity-provider: A custom SimilarityProvider example for Elasticsearch

Exactly, you can even override tf(float freq) to something like "return
freq > 0 ? 1 : 0;" if you don't want to take into account the term
frequency either.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.