Russian words not work with synonym token filter

Hi.
I have test index with settings:
curl -XPOST 'http://localhost:9200/test_index' -d '
{
"settings" : {
"number_of_shards" : 5,
"language":"javascript",
"analysis": {
"filter": {
"snowball_text" : {
"type": "snowball",
"language": "Russian"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "synonym.txt"
}
},
"analyzer": {
"search" : {
"type" :"custom",
"tokenizer": "standard",
"filter": ["snowball_text", "lowercase",
"russian_morphology", "synonym"]
}
}
}
},
"mappings" : {
"test_type" : {
"properties" : {
"test" : {
"type" : "string",
"analyzer" : "search"
},
"description" : {
"type" : "string",
"analyzer" : "search"
}
}
}
}
}'

File synonym.txt:
продажа => купить
аренда => арендовать, сниму, снять
foo => foo bar, baz

English words works fine:
curl -XGET
'http://localhost:9200/test_index/_analyze?text=foo&analyzer=search&pretty=true'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "baz",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "bar",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}

But russian:
curl -XGET
'http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true'
{
"tokens" : [ {
"token" : "タ",
"start_offset" : 3,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "ᄒ",
"start_offset" : 5,
"end_offset" : 6,
"type" : "",
"position" : 2
}, {
"token" : "ᄡ",
"start_offset" : 7,
"end_offset" : 8,
"type" : "",
"position" : 3
}, {
"token" : "ᄚ",
"start_offset" : 9,
"end_offset" : 10,
"type" : "",
"position" : 4
}, {
"token" : "ᄊ",
"start_offset" : 11,
"end_offset" : 12,
"type" : "",
"position" : 5
}, {
"token" : "ᄚ",
"start_offset" : 13,
"end_offset" : 14,
"type" : "",
"position" : 6
} ]
}

I cant't understand what i'm doing wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Despite my name, I do not speak Russian. :slight_smile: Please excuse my ignorance of
the Russian language while I attempt to debug.

Currently, the synonym token filter is being applied after the other three
token filters: "snowball_text", "lowercase", and "russian_morphology". In
this case, the synonym mapping will be executing key lookups on terms that
have been stemmed and lowercase (I do not know what russian_morphology
provides). Try moving your synonym filter before any stemming. After
lowercasing is fine, as long as your synonym map have lowercased values (or
set ignore_case to true). In your example, foo/bar/baz have no further
stemming, so they work as is.

Cheers,

Ivan

On Thu, Mar 6, 2014 at 2:39 AM, Владимир Руденко mailvovana@gmail.comwrote:

Hi.
I have test index with settings:
curl -XPOST 'http://localhost:9200/test_index' -d '
{
"settings" : {
"number_of_shards" : 5,
"language":"javascript",
"analysis": {
"filter": {
"snowball_text" : {
"type": "snowball",
"language": "Russian"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "synonym.txt"
}
},
"analyzer": {
"search" : {
"type" :"custom",
"tokenizer": "standard",
"filter": ["snowball_text", "lowercase",
"russian_morphology", "synonym"]
}
}
}
},
"mappings" : {
"test_type" : {
"properties" : {
"test" : {
"type" : "string",
"analyzer" : "search"
},
"description" : {
"type" : "string",
"analyzer" : "search"
}
}
}
}
}'

File synonym.txt:
продажа => купить
аренда => арендовать, сниму, снять
foo => foo bar, baz

English words works fine:
curl -XGET '
http://localhost:9200/test_index/_analyze?text=foo&analyzer=search&pretty=true
'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "baz",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "bar",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}

But russian:
curl -XGET '
http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true
'
{
"tokens" : [ {
"token" : "タ",
"start_offset" : 3,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "ᄒ",
"start_offset" : 5,
"end_offset" : 6,
"type" : "",
"position" : 2
}, {
"token" : "ᄡ",
"start_offset" : 7,
"end_offset" : 8,
"type" : "",
"position" : 3
}, {
"token" : "ᄚ",
"start_offset" : 9,
"end_offset" : 10,
"type" : "",
"position" : 4
}, {
"token" : "ᄊ",
"start_offset" : 11,
"end_offset" : 12,
"type" : "",
"position" : 5
}, {
"token" : "ᄚ",
"start_offset" : 13,
"end_offset" : 14,
"type" : "",
"position" : 6
} ]
}

I cant't understand what i'm doing wrong?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDk5vd0kN6rNFmFwOOeTgxnrBGQo4d7GN-___Vkj%2BRUug%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

The topic is kind of old, but I'll answer it, just to be helpful for others
who have the similar problem.

The topicstarter used the request
curl -XGET '
http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true
'

The mistake is that the Russian text was not urlencoded.
Elasticsearch treated it as Japanese, as clearly visible in the response.

Always urlencode Russian letters.

Cheers.

четверг, 6 марта 2014 г., 20:31:41 UTC+3 пользователь Ivan Brusic написал:

Despite my name, I do not speak Russian. :slight_smile: Please excuse my ignorance of
the Russian language while I attempt to debug.

Currently, the synonym token filter is being applied after the other three
token filters: "snowball_text", "lowercase", and "russian_morphology". In
this case, the synonym mapping will be executing key lookups on terms
that have been stemmed and lowercase (I do not know what russian_morphology
provides). Try moving your synonym filter before any stemming. After
lowercasing is fine, as long as your synonym map have lowercased values (or
set ignore_case to true). In your example, foo/bar/baz have no further
stemming, so they work as is.

Cheers,

Ivan

On Thu, Mar 6, 2014 at 2:39 AM, Владимир Руденко <mailv...@gmail.com
<javascript:>> wrote:

Hi.
I have test index with settings:
curl -XPOST 'http://localhost:9200/test_index' -d '
{
"settings" : {
"number_of_shards" : 5,
"language":"javascript",
"analysis": {
"filter": {
"snowball_text" : {
"type": "snowball",
"language": "Russian"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "synonym.txt"
}
},
"analyzer": {
"search" : {
"type" :"custom",
"tokenizer": "standard",
"filter": ["snowball_text", "lowercase",
"russian_morphology", "synonym"]
}
}
}
},
"mappings" : {
"test_type" : {
"properties" : {
"test" : {
"type" : "string",
"analyzer" : "search"
},
"description" : {
"type" : "string",
"analyzer" : "search"
}
}
}
}
}'

File synonym.txt:
продажа => купить
аренда => арендовать, сниму, снять
foo => foo bar, baz

English words works fine:
curl -XGET '
http://localhost:9200/test_index/_analyze?text=foo&analyzer=search&pretty=true
'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "baz",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "bar",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}

But russian:
curl -XGET '
http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true
'
{
"tokens" : [ {
"token" : "タ",
"start_offset" : 3,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "ᄒ",
"start_offset" : 5,
"end_offset" : 6,
"type" : "",
"position" : 2
}, {
"token" : "ᄡ",
"start_offset" : 7,
"end_offset" : 8,
"type" : "",
"position" : 3
}, {
"token" : "ᄚ",
"start_offset" : 9,
"end_offset" : 10,
"type" : "",
"position" : 4
}, {
"token" : "ᄊ",
"start_offset" : 11,
"end_offset" : 12,
"type" : "",
"position" : 5
}, {
"token" : "ᄚ",
"start_offset" : 13,
"end_offset" : 14,
"type" : "",
"position" : 6
} ]
}

I cant't understand what i'm doing wrong?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8fa48048-8fec-414a-b3c3-4667c38b2b93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.