How to efficiently use stemmer to improve search results?

Hi there,

look at my gist! https://gist.github.com/damienalexandre/5749824
I use a "francais" analyzer, doing some stuffs:

  • lowercase
  • asciifolding
  • elision
  • stemmer_fr

It use the built-in minimal_french stemmer, so Queens become queen.

My question is,
if I have a "Queens of the stone age" doc and a "Queen" doc, when I
search "Queens", I want both documents but "Queens of the stone age"
should score higher as it contains the whole word.

If you run my gist you will see that both searches return the exact same
result,
that perfectly normal as my stored token are all "queen". But I'm looking
for a way to use the stemmer only as a "second chance" search.

I prefer my search results when I disable it, but I also want to find
documents titled "forgives" when I search "forgive".
Is there a way to do that with ElasticSearch ?

Thank a lot :slight_smile:
Damien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Use multi field to index your field twice once with and once without stemming when searching search against both fields boosting the one without stem. you will probably end up having to use search_string queries of you need to AND words in your query to search across both fields on any permutation of stemmed/unstemmed words.
I also use technique of putting both stemmed and unstemmed tokens into the same field as opposed to two but it has some pitfalls

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I would like to query my documents against the _all field,
that's why in my example you can see that the "francais" analyzer is run
on "_all".

I have tried the multi_field but I can't use it against the _all field,
documentation (
Elasticsearch Platform — Find real-time answers at scale | Elastic) say
this:

The include_in_all setting on the “default” field allows to control if the
value of the field should be included in the _all field. Note, the value of
the field is copied to _all, not the tokens. So, it only makes sense to
copy the field value once. Because of this, the include_in_all setting on
all non-default fields is automatically set to false and can’t be changed.

Here is my test file: pony2.sh · GitHub

It could work as exepted if the stemmer was a tokenizer keeping the
original word,
so Queens became [Queen, Queens] and not just [Queen]. Is that
possible?

Thank for your time :slight_smile:

Damien

On 10 June 2013 19:55, AlexR roytmana@gmail.com wrote:

Use multi field to index your field twice once with and once without
stemming when searching search against both fields boosting the one without
stem. you will probably end up having to use search_string queries of you
need to AND words in your query to search across both fields on any
permutation of stemmed/unstemmed words.
I also use technique of putting both stemmed and unstemmed tokens into the
same field as opposed to two but it has some pitfalls

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/KZaPM5KghYU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I may have found a proper way to do it,
feel free to tell me if that's seems good to you too!

I remove stemming from the _all field,
but I apply it on the name field. And I query like this:

curl -XGET 'http://localhost:9200/pony_index_tmp/_search?pretty=true' -d '
{"query": {
"multi_match" : {
"query" : "queens",
"fields" : [ "_all^2", "name" ]
}
}
}'

And the results are better:

1: Gypsy Queens
2: Queens of the stone age
3: Queen

I have no idea of how my "name" stemmed field is inserted into the _all
one,
does it get stemmed?! That's not clear at all for me.

Thank you,
happy coding!

On 11 June 2013 10:10, Damien Alexandre dalexandre@jolicode.com wrote:

I would like to query my documents against the _all field,
that's why in my example you can see that the "francais" analyzer is
run on "_all".

I have tried the multi_field but I can't use it against the _all field,
documentation (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
say this:

The include_in_all setting on the “default” field allows to control if the
value of the field should be included in the _all field. Note, the value of
the field is copied to _all, not the tokens. So, it only makes sense to
copy the field value once. Because of this, the include_in_all setting on
all non-default fields is automatically set to false and can’t be changed.

Here is my test file: pony2.sh · GitHub

It could work as exepted if the stemmer was a tokenizer keeping the
original word,
so Queens became [Queen, Queens] and not just [Queen]. Is that
possible?

Thank for your time :slight_smile:

Damien

On 10 June 2013 19:55, AlexR roytmana@gmail.com wrote:

Use multi field to index your field twice once with and once without
stemming when searching search against both fields boosting the one without
stem. you will probably end up having to use search_string queries of you
need to AND words in your query to search across both fields on any
permutation of stemmed/unstemmed words.
I also use technique of putting both stemmed and unstemmed tokens into
the same field as opposed to two but it has some pitfalls

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/KZaPM5KghYU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Two things

  1. Is name the only field you need to stem? Sounds unlikely given that you searching on _all. So you will likely end up wit your cusom all-like multi field with stemmed properties which you can then combine with all

  2. Bigger issue is that multimatch will not work if you need to AND your search phrase words as it will search for every word in every field so it will be like searching on name alone. So you would need to use search_string

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.