Problème with french stemmer

Alexandre_Heimburger · June 28, 2011, 6:53pm

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

You will find the configuration in the associated gist.

Thanks for your help.

gist.github.com

https://gist.github.com/alheim/fffe05c43db6fb5f16c9

gistfile1.json

//elasticsearch.json

{
    "index" : { 
        "analysis" : { 
            "analyzer" : { 
                "lowercase_keyword" : { 
                    "type" : "custom",
                    "tokenizer" : "keyword",
                    "filter" : ["lowercase","my_stemmer"]

This file has been truncated. show original

rmuir · June 28, 2011, 7:31pm

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimburger@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.

Alexandre_Heimburger · June 29, 2011, 6:35am

Thx Robert. Can you explain how to use the edge-ngrams in my context ?

On Tue, Jun 28, 2011 at 9:31 PM, Robert Muir rcmuir@gmail.com wrote:

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimburger@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : ahb@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

helps large organizations increase their productivity, foster innovations
and boost people satisfaction.

Alexandre_Heimburger · June 29, 2011, 7:55am

Ok. The n-gram filter works fine.

But I still have a problem with the stemmer.

If I index "élève" :

if I search "elev", there is a result (I think elev is the stemmer
representation of élève)
if I search "élève", there is no result (wierd)

Here is the new gist.

gist.github.com

https://gist.github.com/aheimburger/98fccd76d8b4b1b1f0b1

gistfile1.json

// Configuration
{
    "index" : {
        "analysis" : {
            "filter" : {
              "my_ngram" : {
                "max_gram" : 20,
                "min_gram" : 2,
                "type" : "nGram"
              },

This file has been truncated. show original

On 29 juin, 08:35, Alexandre Heimburger alexheimbur...@gmail.com
wrote:

Thx Robert. Can you explain how to use the edge-ngrams in my context ?

On Tue, Jun 28, 2011 at 9:31 PM, Robert Muir rcm...@gmail.com wrote:

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimbur...@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : a...@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

helps large organizations increase their productivity, foster innovations
and boost people satisfaction.

rmuir · June 29, 2011, 8:16am

On Wed, Jun 29, 2011 at 3:55 AM, alheim alexheimburger@gmail.com wrote:

Ok. The n-gram filter works fine.

But I still have a problem with the stemmer.

If I index "élève" :

if I search "elev", there is a result (I think elev is the stemmer
representation of élève)

if I search "élève", there is no result (wierd)

Here is the new gist.

Stemmer issue · GitHub

why do you still want the stemmer? It does not make sense to me for
autocomplete, so I would remove it!

Topic		Replies	Views
Is there any french lemmatizer available for ElasticSearch? Elasticsearch	3	810	May 25, 2017
Elasticsearch Foreign Language Stop-words Elasticsearch	2	490	July 6, 2017
Stemmer token filter result is different that it should be Elasticsearch	2	373	July 6, 2017
Better French and German stemming? Elasticsearch	4	769	July 16, 2020
New language - Custom analyzer plugin or token filter Elasticsearch	1	541	March 21, 2017

Problème with french stemmer

Related topics