Configure the right analyzer

Hi all,

I'm having a hard time configuring correctly ES to find accented and
non-accented words the same way.

Here is a sequence anyone can execute :


delete index

curl -XDELETE 'http://localhost:9200/my_index'

create index

curl -XPOST 'http://localhost:9200/my_index'

check create ok

curl -XGET 'http://localhost:9200/my_index/_settings'

close index

curl -XPOST 'http://localhost:9200/my_index/_close'

update index settings

curl -XPUT 'http://localhost:9200/my_index/_settings' -d '{
"index.analysis.analyzer.default.type":"snowball",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"standard",
"index.analysis.analyzer.default.filter.1":"lowercase",
"index.analysis.analyzer.default.filter.2":"asciifolding",
"index.analysis.analyzer.default.filter.3":"french_stemmer",
"index.analysis.filter.french_stemmer.type":"stemmer",
"index.analysis.filter.french_stemmer.name":"light_french"
}'

open index

curl -XPOST 'http://localhost:9200/my_index/_open'

check update index settings

curl -XGET 'http://localhost:9200/my_index/_settings'
#* create type*
curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping' -d
'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",
"index":"not_analyzed"}}}}'

check create type

curl -XGET 'http://localhost:9200/my_index/my_type/_mapping'

add data

curl -XPUT 'http://localhost:9200/my_index/my_type/1' -d
'{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2' -d
'{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3' -d
'{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation
H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4' -d
'{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5' -d
'{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'

check add data

curl -XGET 'http://localhost:9200/my_index/my_type/1'

search data

curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingé"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search' -d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingé"}}}'

I configured snowball, asciifolding, french_stemmer on my index.

The queries should return 2 results corresponding to _id=1 and _id=5
The first and the 5th return 0 result.

Any hint on what I am doing wrong ?

Thanks for your help,

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

Hi Louis,

If I'm not wrong, I think you need to change this statement:

curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d
'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",
"index":"not_analyzed"}}}}'

by "index":"analyzed" (which is the default actually). That means
your data will be analyzed by the analyzers you have configured.

You can also create your 'custom' analyzer instead of defining default
analyzers and filters for the whole index, so that you can reuse it
for specific fields/types. Here it is an example of a custom analyzer
I created recently: ES - Adding new custom analyzer · GitHub

You can test your analyzers using the Analyze API, which analyze a
given text and returns the tokens your analyzer generates:

Additionally I think you could simplify you tests defining your index
settings, type and mapping all together in the index creation request.

Finally, I've recommend you the following links:

#Index creation and settings

#Mapping definitions for Core Types (which mentions the analyzed/noy-
analyzed difference)

#Custom analyzers

Hope it helps!
Cheers,

Federic

On Feb 3, 2:06 pm, louis gueye louis.gu...@gmail.com wrote:

Hi all,

I'm having a hard time configuring correctly ES to find accented and
non-accented words the same way.

Here is a sequence anyone can execute :


delete index

curl -XDELETE 'http://localhost:9200/my_index'

create index

curl -XPOST 'http://localhost:9200/my_index'

check create ok

curl -XGET 'http://localhost:9200/my_index/_settings'

close index

curl -XPOST 'http://localhost:9200/my_index/_close'

update index settings

curl -XPUT 'http://localhost:9200/my_index/_settings'-d '{
"index.analysis.analyzer.default.type":"snowball",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"standard",
"index.analysis.analyzer.default.filter.1":"lowercase",
"index.analysis.analyzer.default.filter.2":"asciifolding",
"index.analysis.analyzer.default.filter.3":"french_stemmer",
"index.analysis.filter.french_stemmer.type":"stemmer",
"index.analysis.filter.french_stemmer.name":"light_french"}'

open index

curl -XPOST 'http://localhost:9200/my_index/_open'

check update index settings

curl -XGET 'http://localhost:9200/my_index/_settings'
#* create type*
curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d
'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",
"index":"not_analyzed"}}}}'

check create type

curl -XGET 'http://localhost:9200/my_index/my_type/_mapping'

add data

curl -XPUT 'http://localhost:9200/my_index/my_type/1'-d
'{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2'-d
'{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3'-d
'{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation
H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4'-d
'{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5'-d
'{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'

check add data

curl -XGET 'http://localhost:9200/my_index/my_type/1'

search data

curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingé"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingé"}}}'

I configured snowball, asciifolding, french_stemmer on my index.

The queries should return 2 results corresponding to _id=1 and _id=5
The first and the 5th return 0 result.

Any hint on what I am doing wrong ?

Thanks for your help,

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

Hi Frederic,

1 - The "not_analyzed" for the reference property is on purpose. Reference
is a term thus doesn't need analyzing
2 - I know I can specify all settings during index creation but in
production you cannot afford to drop your index and re-create it. So I
wanted to do it at runtime. By the way, I'm also interested in any working
solution to that propblem.
3 - I'll give a try at the analyze API.

Anyway thank you for the reply

2012/2/4 Frederic focampo.br@gmail.com

Hi Louis,

If I'm not wrong, I think you need to change this statement:

*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

by "index":"analyzed" (which is the default actually). That means
your data will be analyzed by the analyzers you have configured.

You can also create your 'custom' analyzer instead of defining default
analyzers and filters for the whole index, so that you can reuse it
for specific fields/types. Here it is an example of a custom analyzer
I created recently: ES - Adding new custom analyzer · GitHub

You can test your analyzers using the Analyze API, which analyze a
given text and returns the tokens your analyzer generates:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Additionally I think you could simplify you tests defining your index
settings, type and mapping all together in the index creation request.

Finally, I've recommend you the following links:

#Index creation and settings

Elasticsearch Platform — Find real-time answers at scale | Elastic

#Mapping definitions for Core Types (which mentions the analyzed/noy-
analyzed difference)
Elasticsearch Platform — Find real-time answers at scale | Elastic

#Custom analyzers

Elasticsearch Platform — Find real-time answers at scale | Elastic

Hope it helps!
Cheers,

Federic

On Feb 3, 2:06 pm, louis gueye louis.gu...@gmail.com wrote:

Hi all,

I'm having a hard time configuring correctly ES to find accented and
non-accented words the same way.

Here is a sequence anyone can execute :


delete index

curl -XDELETE 'http://localhost:9200/my_index'

create index

curl -XPOST 'http://localhost:9200/my_index'

check create ok

curl -XGET 'http://localhost:9200/my_index/_settings'

close index

curl -XPOST 'http://localhost:9200/my_index/_close'

update index settings

curl -XPUT 'http://localhost:9200/my_index/_settings'-d '{
"index.analysis.analyzer.default.type":"snowball",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"standard",
"index.analysis.analyzer.default.filter.1":"lowercase",
"index.analysis.analyzer.default.filter.2":"asciifolding",
"index.analysis.analyzer.default.filter.3":"french_stemmer",
"index.analysis.filter.french_stemmer.type":"stemmer",
"index.analysis.filter.french_stemmer.name":"light_french"}'

open index

curl -XPOST 'http://localhost:9200/my_index/_open'

check update index settings

curl -XGET 'http://localhost:9200/my_index/_settings'
#* create type*
*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

check create type

curl -XGET 'http://localhost:9200/my_index/my_type/_mapping'

add data

curl -XPUT 'http://localhost:9200/my_index/my_type/1'-d
'{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2'-d
'{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3'-d
'{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation
H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4'-d
'{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5'-d
'{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'

check add data

curl -XGET 'http://localhost:9200/my_index/my_type/1'

search data

curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingé"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingé"}}}'


I configured snowball, asciifolding, french_stemmer on my index.

The queries should return 2 results corresponding to _id=1 and _id=5
The first and the 5th return 0 result.

Any hint on what I am doing wrong ?

Thanks for your help,

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

My fault, it seems I read the "not_analyzed" statement associated with
the title field, instead of the "reference" one.

Of course you're right that is not an option to recreate a productive
index. Actually, adding a custom analyzer to an already productive
index is what I had to do last week, opening and closing the index as
you know, and I don't think there is another option to do so yet.

In my case I configured the analyzer with asciifolding filtering (as
you can see in the Gist) and it worked perfectly for finding words
with accents and special characters (spanish and portuges).

I'll reproduce your steps and let you know what I get. Anyway someone
more experienced with ES could aid you more effectively than I :¬)

PS: Not sure if the queries are just for testing purposes but take
into account that using wildcards as prefixes could generate very slow
queries (Elasticsearch Platform — Find real-time answers at scale | Elastic
wildcard-query.html)

On 6 feb, 06:10, louis gueye louis.gu...@gmail.com wrote:

Hi Frederic,

1 - The "not_analyzed" for the reference property is on purpose. Reference
is a term thus doesn't need analyzing
2 - I know I can specify all settings during index creation but in
production you cannot afford to drop your index and re-create it. So I
wanted to do it at runtime. By the way, I'm also interested in any working
solution to that propblem.
3 - I'll give a try at the analyze API.

Anyway thank you for the reply

2012/2/4 Frederic focampo...@gmail.com

Hi Louis,

If I'm not wrong, I think you need to change this statement:

*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

by "index":"analyzed" (which is the default actually). That means
your data will be analyzed by the analyzers you have configured.

You can also create your 'custom' analyzer instead of defining default
analyzers and filters for the whole index, so that you can reuse it
for specific fields/types. Here it is an example of a custom analyzer
I created recently:ES - Adding new custom analyzer · GitHub

You can test your analyzers using the Analyze API, which analyze a
given text and returns the tokens your analyzer generates:
Elasticsearch Platform — Find real-time answers at scale | Elastic...

Additionally I think you could simplify you tests defining your index
settings, type and mapping all together in the index creation request.

Finally, I've recommend you the following links:

#Index creation and settings

Elasticsearch Platform — Find real-time answers at scale | Elastic...

#Mapping definitions for Core Types (which mentions the analyzed/noy-
analyzed difference)
Elasticsearch Platform — Find real-time answers at scale | Elastic

#Custom analyzers

Elasticsearch Platform — Find real-time answers at scale | Elastic...

Hope it helps!
Cheers,

Federic

On Feb 3, 2:06 pm, louis gueye louis.gu...@gmail.com wrote:

Hi all,

I'm having a hard time configuring correctly ES to find accented and
non-accented words the same way.

Here is a sequence anyone can execute :


delete index

curl -XDELETE 'http://localhost:9200/my_index'

create index

curl -XPOST 'http://localhost:9200/my_index'

check create ok

curl -XGET 'http://localhost:9200/my_index/_settings'

close index

curl -XPOST 'http://localhost:9200/my_index/_close'

update index settings

curl -XPUT 'http://localhost:9200/my_index/_settings'-d'{
"index.analysis.analyzer.default.type":"snowball",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"standard",
"index.analysis.analyzer.default.filter.1":"lowercase",
"index.analysis.analyzer.default.filter.2":"asciifolding",
"index.analysis.analyzer.default.filter.3":"french_stemmer",
"index.analysis.filter.french_stemmer.type":"stemmer",
"index.analysis.filter.french_stemmer.name":"light_french"}'

open index

curl -XPOST 'http://localhost:9200/my_index/_open'

check update index settings

curl -XGET 'http://localhost:9200/my_index/_settings'
#* create type*
*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

check create type

curl -XGET 'http://localhost:9200/my_index/my_type/_mapping'

add data

curl -XPUT 'http://localhost:9200/my_index/my_type/1'-d
'{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2'-d
'{"reference":"ADV-REF-00000002", "title":"Conservateur documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3'-d
'{"reference":"ADV-REF-00000003", "title":"Technicien qualité validation
H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4'-d
'{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5'-d
'{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'

check add data

curl -XGET 'http://localhost:9200/my_index/my_type/1'

search data

curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingé"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingé"}}}'


I configured snowball, asciifolding, french_stemmer on my index.

The queries should return 2 results corresponding to _id=1 and _id=5
The first and the 5th return 0 result.

Any hint on what I am doing wrong ?

Thanks for your help,

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

Thank you frederic.

I've already been warned about wildcards possible slowness but I could not
manage to make my use cases (the ones I provided) pass.
I really want a wildcardless solution but I didn't find yet.

Anyway thank you for your time.

Louis

2012/2/6 Frederic focampo.br@gmail.com

My fault, it seems I read the "not_analyzed" statement associated with
the title field, instead of the "reference" one.

Of course you're right that is not an option to recreate a productive
index. Actually, adding a custom analyzer to an already productive
index is what I had to do last week, opening and closing the index as
you know, and I don't think there is another option to do so yet.

In my case I configured the analyzer with asciifolding filtering (as
you can see in the Gist) and it worked perfectly for finding words
with accents and special characters (spanish and portuges).

I'll reproduce your steps and let you know what I get. Anyway someone
more experienced with ES could aid you more effectively than I :¬)

PS: Not sure if the queries are just for testing purposes but take
into account that using wildcards as prefixes could generate very slow
queries (Elasticsearch Platform — Find real-time answers at scale | Elastic
wildcard-query.html)

On 6 feb, 06:10, louis gueye louis.gu...@gmail.com wrote:

Hi Frederic,

1 - The "not_analyzed" for the reference property is on purpose.
Reference
is a term thus doesn't need analyzing
2 - I know I can specify all settings during index creation but in
production you cannot afford to drop your index and re-create it. So I
wanted to do it at runtime. By the way, I'm also interested in any
working
solution to that propblem.
3 - I'll give a try at the analyze API.

Anyway thank you for the reply

2012/2/4 Frederic focampo...@gmail.com

Hi Louis,

If I'm not wrong, I think you need to change this statement:

*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

by "index":"analyzed" (which is the default actually). That means
your data will be analyzed by the analyzers you have configured.

You can also create your 'custom' analyzer instead of defining default
analyzers and filters for the whole index, so that you can reuse it
for specific fields/types. Here it is an example of a custom analyzer
I created recently:ES - Adding new custom analyzer · GitHub

You can test your analyzers using the Analyze API, which analyze a
given text and returns the tokens your analyzer generates:
Elasticsearch Platform — Find real-time answers at scale | Elastic.
..

Additionally I think you could simplify you tests defining your index
settings, type and mapping all together in the index creation request.

Finally, I've recommend you the following links:

#Index creation and settings

Elasticsearch Platform — Find real-time answers at scale | Elastic.
..

#Mapping definitions for Core Types (which mentions the analyzed/noy-
analyzed difference)
Elasticsearch Platform — Find real-time answers at scale | Elastic

#Custom analyzers

Elasticsearch Platform — Find real-time answers at scale | Elastic.
..

Hope it helps!
Cheers,

Federic

On Feb 3, 2:06 pm, louis gueye louis.gu...@gmail.com wrote:

Hi all,

I'm having a hard time configuring correctly ES to find accented and
non-accented words the same way.

Here is a sequence anyone can execute :


delete index

curl -XDELETE 'http://localhost:9200/my_index'

create index

curl -XPOST 'http://localhost:9200/my_index'

check create ok

curl -XGET 'http://localhost:9200/my_index/_settings'

close index

curl -XPOST 'http://localhost:9200/my_index/_close'

update index settings

curl -XPUT 'http://localhost:9200/my_index/_settings'-d'{
"index.analysis.analyzer.default.type":"snowball",
"index.analysis.analyzer.default.tokenizer":"standard",
"index.analysis.analyzer.default.filter.0":"standard",
"index.analysis.analyzer.default.filter.1":"lowercase",
"index.analysis.analyzer.default.filter.2":"asciifolding",
"index.analysis.analyzer.default.filter.3":"french_stemmer",
"index.analysis.filter.french_stemmer.type":"stemmer",
"index.analysis.filter.french_stemmer.name":"light_french"}'

open index

curl -XPOST 'http://localhost:9200/my_index/_open'

check update index settings

curl -XGET 'http://localhost:9200/my_index/_settings'
#* create type*
*curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping'-d

'{"my_type":{"properties":{"title":{"type":"string"},"reference":{"type":"string",

"index":"not_analyzed"}}}}'*

check create type

curl -XGET 'http://localhost:9200/my_index/my_type/_mapping'

add data

curl -XPUT 'http://localhost:9200/my_index/my_type/1'-d
'{"reference":"ADV-REF-00000001", "title":"Ingénieur Java"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/2'-d
'{"reference":"ADV-REF-00000002", "title":"Conservateur
documentaliste"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/3'-d
'{"reference":"ADV-REF-00000003", "title":"Technicien qualité
validation
H/F"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/4'-d
'{"reference":"ADV-REF-00000004", "title":"Valet de chambre"}'
curl -XPUT 'http://localhost:9200/my_index/my_type/5'-d
'{"reference":"ADV-REF-00000005", "title":"Ingénieur PHP"}'

check add data

curl -XGET 'http://localhost:9200/my_index/my_type/1'

search data

curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:ingé"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingenieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingénieur"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Inge"}}}'
curl -XGET 'http://localhost:9200/my_index/my_type/_search'-d
'{"query":{"query_string":{"analyze_wildcard":"true",
"query":"title:Ingé"}}}'


I configured snowball, asciifolding, french_stemmer on my index.

The queries should return 2 results corresponding to _id=1 and _id=5
The first and the 5th return 0 result.

Any hint on what I am doing wrong ?

Thanks for your help,

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

I've created a gist to explain the desired behaviour.

In the gist I never figured out the right index settings that support
"ignore accents" and "lowercase" both at "indexing" and searching phases.

Actually when I analyze "Ingénieur Java" it is correctly stored:
{"tokens":[{"token":"ingenieur","start_offset":0,"end_offset":9,"type":"","position":1},{"token":"java","start_offset":10,"end_offset":14,"type":"","position":2}]}

But when I search with accent I have no hit:
curl http://localhost:9200/my_index/my_type/_search?q=ingén*

I really to get this feature working otherwise my client will never switch
to elasticsearch.

Thx for your help.

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/|
twitter http://twitter.com/#!/lgueye

Where is the gist?

On Tuesday, February 7, 2012 at 4:53 PM, louis gueye wrote:

I've created a gist to explain the desired behaviour.

In the gist I never figured out the right index settings that support "ignore accents" and "lowercase" both at "indexing" and searching phases.

Actually when I analyze "Ingénieur Java" it is correctly stored:
{"tokens":[{"token":"ingenieur","start_offset":0,"end_offset":9,"type":"","position":1},{"token":"java","start_offset":10,"end_offset":14,"type":"","position":2}]}

But when I search with accent I have no hit:
curl http://localhost:9200/my_index/my_type/_search?q=ingén* (http://localhost:9200/my_index/my_type/_search?q=ingén*)

I really to get this feature working otherwise my client will never switch to elasticsearch.

Thx for your help.

--
Cordialement/Regards,

Louis GUEYE
linkedin (http://fr.linkedin.com/in/louisgueye) | blog (http://deepintojee.wordpress.com/) | twitter (http://twitter.com/#!/lgueye)

A classic :frowning: sorry.

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/
| twitter http://twitter.com/#!/lgueye

2012/2/7 Shay Banon kimchy@gmail.com

Where is the gist?

It happens because you don't encode the special character properly when you provide it in the URL. Here are samples that work: gist:1777975 · GitHub.

On Tuesday, February 7, 2012 at 9:30 PM, louis gueye wrote:

A classic :frowning: sorry.

elasticsearch : dealing with case and accents · GitHub

--
Cordialement/Regards,

Louis GUEYE
linkedin (http://fr.linkedin.com/in/louisgueye) | blog (http://deepintojee.wordpress.com/) | twitter (http://twitter.com/#!/lgueye)

2012/2/7 Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)>

Where is the gist?

Hi Shay,

Got it. What I understand is that *accents can't be interpreted without **
wildcards **analysis * ...

I was hoping they could be because wildcard analysis is known to possibly
be slow.

Anyway thanks a lot for your time.

When I have more time, I'll run more tests on accents and provide feedback.

Note : %e9 or %25e9 lead to the same result, they are found.

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye |
bloghttp://deepintojee.wordpress.com/
| twitter http://twitter.com/#!/lgueye

2012/2/9 Shay Banon kimchy@gmail.com

It happens because you don't encode the special character properly when
you provide it in the URL. Here are samples that work:
gist:1777975 · GitHub.

On Tuesday, February 7, 2012 at 9:30 PM, louis gueye wrote:

A classic :frowning: sorry.

elasticsearch : dealing with case and accents · GitHub

--
Cordialement/Regards,

Louis GUEYE
linkedin http://fr.linkedin.com/in/louisgueye | bloghttp://deepintojee.wordpress.com/
| twitter http://twitter.com/#!/lgueye

2012/2/7 Shay Banon kimchy@gmail.com

Where is the gist?