QueryString


(k4Rla) #1

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(Igor Motov) #2

The Russian Snowball analyzer translates word "Авто" into token "авт":

curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d "Авто"
Авто{"tokens":[{"token":"авт","start_offset":0,"end_offset":4,"type":"","position":1}]}%

Lucene doesn't apply analyzer to wildcard queries, it only lowercases terms
there. So, your query is translated into Lucene wildcard query looking for
all terms that starts with "авто". "авт" doesn't start with "авто" and
that's why the record is not returned.

On Tuesday, March 13, 2012 10:46:42 AM UTC-4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(k4Rla) #3

Thank you for answer, any chance to resolve it? I need to request "Авто",
and get results, that match for "Авто*". And another one question, when
i'm searching for 'Автомобиль', it returns me "Автомобильный туризм", but
not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(Igor Motov) #4

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian Snowball analyzer works. As
you can see below "Автомобиль" is translated into "автомобил", while
"Автомобильный туризм" into "автомобильн", "туризм" and "Японские
автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"
{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"
{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request "Авто",
and get results, that match for "Авто*". And another one question, when
i'm searching for 'Автомобиль', it returns me "Автомобильный туризм", but
not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(k4Rla) #5

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian Snowball analyzer works.
As you can see below "Автомобиль" is translated into "автомобил", while
"Автомобильный туризм" into "автомобильн", "туризм" and "Японские
автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request "Авто",
and get results, that match for "Авто*". And another one question, when
i'm searching for 'Автомобиль', it returns me "Автомобильный туризм",
but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(Igor Motov) #6

Russian Snowball Analyzer is basically
this: http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/ .
If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian Snowball analyzer works.
As you can see below "Автомобиль" is translated into "автомобил", while
"Автомобильный туризм" into "автомобильн", "туризм" and "Японские
автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using Snowball.
When i'm search some word using , Elastic returns no results, where this
word matches, it returns me only results with length, longer than query
word length. For example. I have query word Auto (in russian Авто) and
documents with _source ['name'] => Auto, ['name'] => Automobile, ['name']
=> Automobile showroom. If i search 'Auto
' Elastic returns me only
"Automobile" and "Automobile showroom", but not 'Auto'. When i'am using
english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if that
need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(k4Rla) #7

Thats exactly what i need, it would be simply wonderful if you could help
me to stitch this with Elastic!

On Tuesday, March 13, 2012 9:38:39 PM UTC+4, Igor Motov wrote:

Russian Snowball Analyzer is basically this:
http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/ .
If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian Snowball analyzer works.
As you can see below "Автомобиль" is translated into "автомобил", while
"Автомобильный туризм" into "автомобильн", "туризм" and "Японские
автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using
Snowball. When i'm search some word using , Elastic returns no results,
where this word matches, it returns me only results with length, longer
than query word length. For example. I have query word Auto (in russian
Авто) and documents with _source ['name'] => Auto, ['name'] =>
Automobile, ['name'] => Automobile showroom. If i search 'Auto
' Elastic
returns me only "Automobile" and "Automobile showroom", but not 'Auto'.
When i'am using english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if
that need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(Igor Motov) #8

Here you go https://github.com/imotov/elasticsearch-analysis-morphology

On Tuesday, March 13, 2012 2:32:31 PM UTC-4, k4Rla wrote:

Thats exactly what i need, it would be simply wonderful if you could help
me to stitch this with Elastic!

On Tuesday, March 13, 2012 9:38:39 PM UTC+4, Igor Motov wrote:

Russian Snowball Analyzer is basically this:
http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/. If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian
Snowball analyzer works. As you can see below "Автомобиль" is translated
into "автомобил", while "Автомобильный туризм" into "автомобильн", "туризм"
and "Японские автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using
Snowball. When i'm search some word using , Elastic returns no results,
where this word matches, it returns me only results with length, longer
than query word length. For example. I have query word Auto (in russian
Авто) and documents with _source ['name'] => Auto, ['name'] =>
Automobile, ['name'] => Automobile showroom. If i search 'Auto
' Elastic
returns me only "Automobile" and "Automobile showroom", but not 'Auto'.
When i'am using english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if
that need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(k4Rla) #9

Thank you very very much, it's awesome and works exactly as i need!

On Wednesday, March 14, 2012 6:12:05 AM UTC+4, Igor Motov wrote:

Here you go https://github.com/imotov/elasticsearch-analysis-morphology

On Tuesday, March 13, 2012 2:32:31 PM UTC-4, k4Rla wrote:

Thats exactly what i need, it would be simply wonderful if you could help
me to stitch this with Elastic!

On Tuesday, March 13, 2012 9:38:39 PM UTC+4, Igor Motov wrote:

Russian Snowball Analyzer is basically this:
http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/. If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian
Snowball analyzer works. As you can see below "Автомобиль" is translated
into "автомобил", while "Автомобильный туризм" into "автомобильн", "туризм"
and "Японские автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using
Snowball. When i'm search some word using , Elastic returns no results,
where this word matches, it returns me only results with length, longer
than query word length. For example. I have query word Auto (in russian
Авто) and documents with _source ['name'] => Auto, ['name'] =>
Automobile, ['name'] => Automobile showroom. If i search 'Auto
' Elastic
returns me only "Automobile" and "Automobile showroom", but not 'Auto'.
When i'am using english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if
that need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');


(Павел Суслов) #10

Is it possible to tune this analyzer for parsing dashed words? By default
it splits dashed words to parts

среда, 14 марта 2012 г., 6:12:05 UTC+4 пользователь Igor Motov написал:

Here you go https://github.com/imotov/elasticsearch-analysis-morphology

On Tuesday, March 13, 2012 2:32:31 PM UTC-4, k4Rla wrote:

Thats exactly what i need, it would be simply wonderful if you could help
me to stitch this with Elastic!

On Tuesday, March 13, 2012 9:38:39 PM UTC+4, Igor Motov wrote:

Russian Snowball Analyzer is basically this:
http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/. If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's works
perfect, thank you very much, but how to be with 'Автомобиль'? Have russian
snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian
Snowball analyzer works. As you can see below "Автомобиль" is translated
into "автомобил", while "Автомобильный туризм" into "автомобильн", "туризм"
and "Японские автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using
Snowball. When i'm search some word using , Elastic returns no results,
where this word matches, it returns me only results with length, longer
than query word length. For example. I have query word Auto (in russian
Авто) and documents with _source ['name'] => Auto, ['name'] =>
Automobile, ['name'] => Automobile showroom. If i search 'Auto
' Elastic
returns me only "Automobile" and "Automobile showroom", but not 'Auto'.
When i'am using english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if
that need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');

--


(Igor Motov) #11

The morpology analyzer is using standard tokenizer and standard, lowercase,
and morphology filters. In other words it's defined internally like this:

index.analysis.analyzer.russian_morphology:
type: custom
tokenizer: standard
filter: standard, lowercase, russian_morphology

You can create your own analyzer that will contain another tokenizer. For
example, you can use whitespace tokenizer instead

index.analysis.analyzer.my_russian_morphology:
type: custom
tokenizer: whitespace
filter: lowercase, russian_morphology

If you would like to keep standard tokenizer functionality but you need to
process dashes in a different way, see discussion
here: https://groups.google.com/d/topic/elasticsearch/eZJ7d4g71ZQ/discussion

On Wednesday, August 15, 2012 4:47:41 AM UTC-4, Павел Суслов wrote:

Is it possible to tune this analyzer for parsing dashed words? By default
it splits dashed words to parts

среда, 14 марта 2012 г., 6:12:05 UTC+4 пользователь Igor Motov написал:

Here you go https://github.com/imotov/elasticsearch-analysis-morphology

On Tuesday, March 13, 2012 2:32:31 PM UTC-4, k4Rla wrote:

Thats exactly what i need, it would be simply wonderful if you could
help me to stitch this with Elastic!

On Tuesday, March 13, 2012 9:38:39 PM UTC+4, Igor Motov wrote:

Russian Snowball Analyzer is basically this:
http://snowball.tartarus.org/algorithms/russian/stemmer.html In my
opinion, this rule-based approach is simply not adequate for such
morphologically complex language as Russian. You need to look into
something dictionary-based. I am not working with Russian text at the
moment and cannot recommend any specific analyzer. You might want to take a
look at something like this http://code.google.com/p/russianmorphology/. If it works for you, I can help you stitching it with elasticsearch.

On Tuesday, March 13, 2012 1:03:26 PM UTC-4, k4Rla wrote:

oh, i didn't saw that analize_wildcard is false, my bad, now it's
works perfect, thank you very much, but how to be with 'Автомобиль'? Have
russian snowball some morfology?

On Tuesday, March 13, 2012 8:21:20 PM UTC+4, Igor Motov wrote:

Yes, if you are using a recent version of elasticsearch, you can
specify analyze_wildcard:true in your query. (See
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html)
I think it was added in 0.18.6.

The second problem is caused by the way Russian
Snowball analyzer works. As you can see below "Автомобиль" is translated
into "автомобил", while "Автомобильный туризм" into "автомобильн", "туризм"
and "Японские автомобили" into "японск", "автомоб".

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобиль"

{"tokens":[{"token":"автомобил","start_offset":0,"end_offset":10,"type":"","position":1}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Автомобильный туризм"

{"tokens":[{"token":"автомобильн","start_offset":0,"end_offset":13,"type":"","position":1},{"token":"туризм","start_offset":14,"end_offset":20,"type":"","position":2}]}

$ curl "localhost:9200/twitter/_analyze?analyzer=indexAnalyzer" -d
"Японские автомобили"
{"tokens":[{"token":"японск","start_offset":0,"end_offset":8,"type":"","position":1},{"token":"автомоб","start_offset":9,"end_offset":19,"type":"","position":2}]}%

On Tuesday, March 13, 2012 12:06:27 PM UTC-4, k4Rla wrote:

Thank you for answer, any chance to resolve it? I need to request
"Авто", and get results, that match for "Авто*". And another one
question, when i'm searching for 'Автомобиль', it returns me "Автомобильный
туризм", but not "Японские автомобили", is this same reason?

On Tuesday, March 13, 2012 6:46:42 PM UTC+4, k4Rla wrote:

Hi, here is problem with QueryString search in cyrillic using
Snowball. When i'm search some word using , Elastic returns no results,
where this word matches, it returns me only results with length, longer
than query word length. For example. I have query word Auto (in russian
Авто) and documents with _source ['name'] => Auto, ['name'] =>
Automobile, ['name'] => Automobile showroom. If i search 'Auto
' Elastic
returns me only "Automobile" and "Automobile showroom", but not 'Auto'.
When i'am using english versions of this words everything workes fine.

Here is some settings. (Elastica client used, looks simple, but if
that need i can comment some rows).

public static $_elastica = array(
'number_of_shards' => 4,
'number_of_replicas' => 1,
'analysis' => array(
'analyzer' => array(
'indexAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball')
),
'searchAnalyzer' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'mySnowball', 'stop')
)
),
'filter' => array(
'mySnowball' => array(
'type' => 'snowball',
'language' => 'russian',
),
'stop' => array(
'type' => 'stop',
'stopwords' =>
'а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я',
)
)
)
);

//Here we set it to index
$index->create(self::$_elastica, true);

//And here is when we creating a document
$mapping = new Elastica_Type_Mapping($type, self::$_elastica);
$mapping->setProperties(self::$_elasticaMapping);
$mapping->setParam('index_analyzer', 'indexAnalyzer');
$mapping->setParam('search_analyzer', 'searchAnalyzer');
$mapping->send();

//This is a search field mapping
'name' => array('type' => 'string', '_analyzer' => array('path' =>
'mySnowball'))

//And here is a search query
$query = new Elastica_Query_QueryString('Авто*');
$query->setAnalyzer('searchAnalyzer');

--


(system) #12