Problème with french stemmer


(Alexandre Heimburger) #1

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

You will find the configuration in the associated gist.

Thanks for your help.


#2

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimburger@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.


(Alexandre Heimburger) #3

Thx Robert. Can you explain how to use the edge-ngrams in my context ?

On Tue, Jun 28, 2011 at 9:31 PM, Robert Muir rcmuir@gmail.com wrote:

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimburger@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : ahb@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

  • helps large organizations increase their productivity, foster innovations
    and boost people satisfaction.

(Alexandre Heimburger) #4

Ok. The n-gram filter works fine.

But I still have a problem with the stemmer.

If I index "élève" :

  • if I search "elev", there is a result (I think elev is the stemmer
    representation of élève)
  • if I search "élève", there is no result (wierd)

Here is the new gist.

On 29 juin, 08:35, Alexandre Heimburger alexheimbur...@gmail.com
wrote:

Thx Robert. Can you explain how to use the edge-ngrams in my context ?

On Tue, Jun 28, 2011 at 9:31 PM, Robert Muir rcm...@gmail.com wrote:

On Tue, Jun 28, 2011 at 2:53 PM, alheim alexheimbur...@gmail.com wrote:

Hi,

I work on a "search as you type" box.

Many users are indexed.

For example, when I type : Fre

I should get Docteur Freud, Frédéric Durand, Jean Frérac.... all these
names matching Fre / Fré / Frê etc....

I tried to use a custom analyzer with a light french stemmer but I can
only get Frédéric if I type Fré.

prefix queries do not go thru the lucene analyzer... so if this is
your use case for this field I would not recommend stemming it!

P.S. you might want to consider using something more efficient for
search-as-you-type, lucene's wildcards are slow and I think you would
find edge-ngrams much faster.

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : a...@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

  • helps large organizations increase their productivity, foster innovations
    and boost people satisfaction.

#5

On Wed, Jun 29, 2011 at 3:55 AM, alheim alexheimburger@gmail.com wrote:

Ok. The n-gram filter works fine.

But I still have a problem with the stemmer.

If I index "élève" :

  • if I search "elev", there is a result (I think elev is the stemmer
    representation of élève)
  • if I search "élève", there is no result (wierd)

Here is the new gist.

https://gist.github.com/98fccd76d8b4b1b1f0b1

why do you still want the stemmer? It does not make sense to me for
autocomplete, so I would remove it!


(system) #6