Italian language support and stemming


(Stefano Masini) #1

Hi,

I've been playing a bit with ElasticSearch and I'm wondering how good is
the support for the Italian language.

Here's what I've done:

PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"name": {
"type": "string",
"analyzer": "italian"
}
}
}
}
}

PUT myindex/mytype/1
{
"name": "Busta"
}

PUT myindex/mytype/2
{
"name": "Attacco"
}

Now, if I do a search like this:

GET myindex/mytype/_search
{
"query": {
"match": {
"name": "attacchi"
}
}
}

I get 1 result, as expected: "attacchi" is the plural for "attacco".

But if I search for "buste" (which is the plural for "busta"), I get 0
results.

This seems to me a pretty basic example of stemming in Italian. How well is
this language supported in ElasticSearch?

Thanks!
Stefano

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

you can simply verify this behaviour yourself using the analyze API

» curl 'localhost:9200/_analyze?analyzer=italian' -d 'attacchi'
{"tokens":[{"token":"attacc","start_offset":0,"end_offset":8,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'attacco'
{"tokens":[{"token":"attacc","start_offset":0,"end_offset":7,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'busta'
{"tokens":[{"token":"busta","start_offset":0,"end_offset":5,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'buste'
{"tokens":[{"token":"buste","start_offset":0,"end_offset":5,"type":"","position":1}]}

As you can see, the italian analzyer does not work as expected. Be aware
that most of this analzyers are algorithmic, and thus not very powerful for
complex languages or irregular parts of it (I dont speak italian, so I cant
comment on that specific quality).

--Alex

On Fri, Nov 22, 2013 at 11:00 AM, Stefano Masini
stefano.masini@gmail.comwrote:

Hi,

I've been playing a bit with ElasticSearch and I'm wondering how good is
the support for the Italian language.

Here's what I've done:

PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"name": {
"type": "string",
"analyzer": "italian"
}
}
}
}
}

PUT myindex/mytype/1
{
"name": "Busta"
}

PUT myindex/mytype/2
{
"name": "Attacco"
}

Now, if I do a search like this:

GET myindex/mytype/_search
{
"query": {
"match": {
"name": "attacchi"
}
}
}

I get 1 result, as expected: "attacchi" is the plural for "attacco".

But if I search for "buste" (which is the plural for "busta"), I get 0
results.

This seems to me a pretty basic example of stemming in Italian. How well
is this language supported in ElasticSearch?

Thanks!
Stefano

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #3

You can try the hunspell stemmer, which is dictionary based. Usually better
with european languages but the result really depends on the quality of the
dictionary.

Cheers

On Saturday, November 23, 2013 6:57:03 AM UTC+1, Alexander Reelsen wrote:

Hey,

you can simply verify this behaviour yourself using the analyze API

» curl 'localhost:9200/_analyze?analyzer=italian' -d 'attacchi'

{"tokens":[{"token":"attacc","start_offset":0,"end_offset":8,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'attacco'

{"tokens":[{"token":"attacc","start_offset":0,"end_offset":7,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'busta'

{"tokens":[{"token":"busta","start_offset":0,"end_offset":5,"type":"","position":1}]}
» curl 'localhost:9200/_analyze?analyzer=italian' -d 'buste'

{"tokens":[{"token":"buste","start_offset":0,"end_offset":5,"type":"","position":1}]}

As you can see, the italian analzyer does not work as expected. Be aware
that most of this analzyers are algorithmic, and thus not very powerful for
complex languages or irregular parts of it (I dont speak italian, so I cant
comment on that specific quality).

--Alex

On Fri, Nov 22, 2013 at 11:00 AM, Stefano Masini <stefano...@gmail.com<javascript:>

wrote:

Hi,

I've been playing a bit with ElasticSearch and I'm wondering how good is
the support for the Italian language.

Here's what I've done:

PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"name": {
"type": "string",
"analyzer": "italian"
}
}
}
}
}

PUT myindex/mytype/1
{
"name": "Busta"
}

PUT myindex/mytype/2
{
"name": "Attacco"
}

Now, if I do a search like this:

GET myindex/mytype/_search
{
"query": {
"match": {
"name": "attacchi"
}
}
}

I get 1 result, as expected: "attacchi" is the plural for "attacco".

But if I search for "buste" (which is the plural for "busta"), I get 0
results.

This seems to me a pretty basic example of stemming in Italian. How well
is this language supported in ElasticSearch?

Thanks!
Stefano

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4