Issue when indexing french words with ^

Alexandre_Heimburger · August 30, 2011, 10:17am

Hey

I gist my configuration and query.

gist.github.com

https://gist.github.com/alheim/211d97b1d7cd3eb1aeac

gistfile1.js

{
    "index" : { 
        "analysis" : { 
            "filter" : { 
              "my_ngram" : { 
                "max_gram" : 20, 
                "min_gram" : 2,
                "type" : "nGram"
              },  
              "my_snow" : {

This file has been truncated. show original

The context

I use a stemmer-ngram filter to index the title field of my documents. (It
enables me to implement a super fast autocompletion btw).

I works great with french words containing é, è (i.e I can search
theorieto find documents indexed with
théorie).

But it does not work with french words containing ^.

I index the title "Pôle web 2.0" which I cannot find using "pole" term.

It seems that the n-gram tokenizer does not recognize ^ as an accent.

Any idea ?

-Alex-

Alexandre_Heimburger · August 31, 2011, 12:15pm

No idea everybody ?

On 30 août, 12:17, Alexandre Heimburger alexheimbur...@gmail.com
wrote:

Hey

I gist my configuration and query.

Ngram indexation of french words containing ^ · GitHub

The context

I use a stemmer-ngram filter to index the title field of my documents. (It
enables me to implement a super fast autocompletion btw).

I works great with french words containing é, è (i.e I can search
theorieto find documents indexed with
théorie).

But it does not work with french words containing ^.

I index the title "Pôle web 2.0" which I cannot find using "pole" term.

It seems that the n-gram tokenizer does not recognize ^ as an accent.

Any idea ?

-Alex-

Tomislav_Poljak · August 31, 2011, 2:42pm

Hi,
I don't think n-gram tokenizer strips any accents. You need to use
ASCII Folding Token Filter
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
for this, at both index and query time analysis. I've altered your AC
analysis (added "asciifolding" filter at both index and query time
analysis), check stripping accents in auto-complete analysis · GitHub -> I've tested with
"query": "pole" and it matches.

Hope this helps,

Tomislav

2011/8/31 alheim alexheimburger@gmail.com:

No idea everybody ?

On 30 août, 12:17, Alexandre Heimburger alexheimbur...@gmail.com
wrote:

Hey

I gist my configuration and query.

Ngram indexation of french words containing ^ · GitHub

The context

I use a stemmer-ngram filter to index the title field of my documents. (It
enables me to implement a super fast autocompletion btw).

I works great with french words containing é, è (i.e I can search
theorieto find documents indexed with
théorie).

But it does not work with french words containing ^.

I index the title "Pôle web 2.0" which I cannot find using "pole" term.

It seems that the n-gram tokenizer does not recognize ^ as an accent.

Any idea ?

-Alex-

Alexandre_Heimburger · August 31, 2011, 4:15pm

Thanks a lot. I test tomorrow morning and I'll tell you.

On Wed, Aug 31, 2011 at 4:42 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
I don't think n-gram tokenizer strips any accents. You need to use
ASCII Folding Token Filter
(
Elasticsearch Platform — Find real-time answers at scale | Elastic
)
for this, at both index and query time analysis. I've altered your AC
analysis (added "asciifolding" filter at both index and query time
analysis), check stripping accents in auto-complete analysis · GitHub -> I've tested with
"query": "pole" and it matches.

Hope this helps,

Tomislav

2011/8/31 alheim alexheimburger@gmail.com:

No idea everybody ?

On 30 août, 12:17, Alexandre Heimburger alexheimbur...@gmail.com
wrote:

Hey

I gist my configuration and query.

Ngram indexation of french words containing ^ · GitHub

The context

I use a stemmer-ngram filter to index the title field of my documents.
(It
enables me to implement a super fast autocompletion btw).

I works great with french words containing é, è (i.e I can search
theorieto find documents indexed with
théorie).

But it does not work with french words containing ^.

I index the title "Pôle web 2.0" which I cannot find using "pole" term.

It seems that the n-gram tokenizer does not recognize ^ as an accent.

Any idea ?

-Alex-

--
Alexandre Heimburger
R&D Manager
blueKiwi Software
tel : +33687880997
email : ahb@bluekiwi-software.com
adress : 93 rue Vieille du Temple, 75003 Paris

What is blueKiwi? blueKiwi - the first Enterprise Social Software Suite in
the world building professional networks on conversations and relationships

helps large organizations increase their productivity, foster innovations
and boost people satisfaction.

Topic		Replies	Views
Accent with edge ngram token filter Elasticsearch	3	342	March 8, 2023
Ngram Analyzer cannot detect "^" Elasticsearch	1	368	February 24, 2017
Word with accent and searching Elasticsearch	5	1144	July 6, 2017
Strange stemmer behavior on accents Elasticsearch	1	330	July 6, 2017
Problème with french stemmer Elasticsearch	5	620	July 6, 2017

Issue when indexing french words with ^

Related topics