Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer

hi

is it possible to use lang analyzer together with asciifolding tokenizer or
icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to make
it work http://snowball.tartarus.org/otherapps/oregan/intro.html but i do
not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

right now, the only solution i can think of, is filtering those characters
in the application.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thursday, April 18, 2013 3:50:19 PM UTC+2, gondo wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/otherapps/oregan/intro.html but
i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

I am not sure I am getting what you are asking for, can you use the ICU
filters / tokeinizer together with "czech_stem" in ElasticSearch which is
essentially the czech anazlyer?

simon

right now, the only solution i can think of, is filtering those characters
in the application.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

On Thu, Apr 18, 2013 at 3:50 PM, gondo gondar@webdesigners.sk wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/otherapps/oregan/intro.html but
i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

the way I would do this is using multi filed, see
http://www.elasticsearch.org/guide/reference/mapping/multi-field-type/

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

AFAIK this is called stemming. So if you do not want stemming, do not use
czech_stem filter or czech analyzer.

For example, if you take this gist
https://gist.github.com/lukas-vlcek/4673027 can you try to modify analyzer
called "cestina2" and replace filter "czech_stemmer1" in it with
"asciifolding"? Would that work for you?

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hi
sorry i guess my terminology isn't correct, i meant fileters not
tokenizers, im just starting with ES.
im quite happy with ES lang (czech) analyzer, however i would want to
add asciifolding filter to it.
so in my example, i want "š" to be transformed into "s", "á" -> "a"
and another example:
"Ň" -> "n"
however i was not able to extend "czech" analyzer to add "asciifolding" to
it.
unfortunately there is no documentation what would say what tokenizers,
filters ... is used in language analyzer, so i could not just add
"asciifolding" filter into filter arrays.
by googling around, i found icu pluginhttps://github.com/elasticsearch/elasticsearch-analysis-icuwhat provides "icu_normalizer" filter and some german
exampleshttp://jprante.github.io/lessons/2012/05/16/multilingual-analysis-for-title-search.html.
problem with this is, that czech doesnt seems to be supported. (in addition
to confusion, the plugin mentions langauge as "en", however the german
example use "German2")

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@lukas:
thanks, i've come across your example before without a success. i ll try to
modify it as you suggested.
side questions:
in this examplehttp://jprante.github.io/lessons/2012/05/16/multilingual-analysis-for-title-search.html they achieved what
i want (but in German not Czech) by using combo analyzerhttps://github.com/yakaz/elasticsearch-analysis-combo/
but i was not able to make it work :confused:
if i understand it correctly, multi_field is helpful just to preserve
original token before any analysis and you can't setup different filters.
what i need is:
filters: asciifolding + lowercase + the one used in czech analyzer (one or
multiple)
stop words: the one used in czech analyzer (one or multiple)
tokenizer: the one used in czech analyzer (one or multiple)
stemmer: czech (this is optional)

so essentially i want to add asciifolding (or icu_normalizer) filter to
default czech analyzer
any idea?

On Friday, 19 April 2013 00:47:28 UTC+10, Lukáš Vlček wrote:

Hi,

On Thu, Apr 18, 2013 at 3:50 PM, gondo <gon...@webdesigners.sk<javascript:>

wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/otherapps/oregan/intro.htmlbut i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

the way I would do this is using multi filed, see
http://www.elasticsearch.org/guide/reference/mapping/multi-field-type/

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

AFAIK this is called stemming. So if you do not want stemming, do not use
czech_stem filter or czech analyzer.

For example, if you take this gist
https://gist.github.com/lukas-vlcek/4673027 can you try to modify
analyzer called "cestina2" and replace filter "czech_stemmer1" in it with
"asciifolding"? Would that work for you?

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

im little bit confused now as i dont know what is the name of the filter
with functionality i want: "Ň" -> "n"
i think i meant "icu_folding" rather than "icu_normalizer"

whats the difference between icu_normalizer and build in lang analyzer?
apart from the fact, that icu_normalizer cant be set to Czech apparently?

the same question for icu_collation, whats the difference between
icu_collation and build in lang analyzer?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@lukas: regards your gist example,what are the exact differences
between czech_stemmer1 and czech_stemmer2 filters? can you give me some
example what will produce different results after applying those two
filters? (they seems to be identical to me)
btw after rereading your example couple of times, i've noticed your last
comment, that is very valuable information and it should be mentioned in
documentation: "Note the custom analyzer is in fact the same to what is
preconfigured in cestina1 under the hood"

i think that what i want is "cestina" analyzer in my code below, it should
be "czech" analyzer + "asciifolding" filter.
question, are "cestina_1" and "cestina_2" valid analyzers?
and if you know, what are the differences between "asciifolding" and
"icu_folding" for czech?

...
"analyzer" : {
"cestina" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding",
"czech_stop", "czech_stemmer"]
},
"cestina_1" : {
"type" : "czech",
"filter" : ["asciifolding"]
},
"desky_3" : {
"type" : "czech",
"filter" : ["icu_folding"]
}
},
"filter" : {
"czech_stemmer" : {
"type" : "stemmer",
"name" : "czech"
},
"czech_stemmer_2" : {
"type" : "czech_stem"
},
"czech_stop" : {
"type" : "stop",
"stopwords" : ["czech"]
}
}
...

thanks

On Thursday, 18 April 2013 23:50:19 UTC+10, gondo wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/otherapps/oregan/intro.html but
i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

right now, the only solution i can think of, is filtering those characters
in the application.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

On Fri, Apr 19, 2013 at 7:32 AM, gondo gondar@webdesigners.sk wrote:

@lukas: regards your gist example,what are the exact differences
between czech_stemmer1 and czech_stemmer2 filters? can you give me some
example what will produce different results after applying those two
filters? (they seems to be identical to me)

there is no difference between czech_stemmer1 and czech_stemmer2 filters
here [https://gist.github.com/lukas-vlcek/4673027] they both do exactly the
same. They are just "synonyms". I wanted to put together all possible
configurations and make consolidated example. Both are valid options and as
long as they are both supported in ES you can use either first or second
filter.

So if in turn cestina4 analyzer would be using czech_stemmer1 filter
instead of 2 if would be the same as well.

btw after rereading your example couple of times, i've noticed your last
comment, that is very valuable information and it should be mentioned in
documentation: "Note the custom analyzer is in fact the same to what is
preconfigured in cestina1 under the hood"

Yes, that is the point. The difference between cestina1 and cestina2 (or
between cestina1 and cestina3) is the later analyzer does not exclude stop
words. So you have to add it into the filter chain, and that is what
cestina4 is intended to demonstrate.

i think that what i want is "cestina" analyzer in my code below, it should
be "czech" analyzer + "asciifolding" filter.

yes, I think this should be what you need to start with. Only after you
realise that this is not enough and you need something more advanced then
dig deeper (it probably depends on your use case, for example the article
from Jörg that you linked before discusses advanced topic related to
library catalog data, if your use case is not that complicated, the chance
is you will be fine with asciifolding or/and czech stemmer or even ngrams).

question, are "cestina_1" and "cestina_2" valid analyzers?

The documentation for language analysers [1] do not mention possibility to
setup custom filters (if it is possible then it should be documented). So
to be a pure useless asshole I would say they are "valid" and "identical"
(because the filter option is simply ignored).

[1]
http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/

and if you know, what are the differences between "asciifolding" and

"icu_folding" for czech?

I am not expert on ICU. But it seems that the Java API of the ICU project
supports Czech. It might be a nice experiment to give it a try (so for
example that you can get correct sort order for tokens with "ch" and "h").

As for the asciifolding I simply use it to get rid of diacritics: ů -> u, ě
-> e, ... in czech language this works pretty fine I think. But if you are
using asciifolding then I would put it after stemmer in the filter chain.
The stemmer for czech (if you want to use it) is rule based and the chance
is that some of the rules require diacritics (I would have to check the
source code but generally there is no reason to put stemmer after
asciifolding IMO).

HTH,

Lukáš

...
"analyzer" : {
"cestina" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding",
"czech_stop", "czech_stemmer"]
},
"cestina_1" : {
"type" : "czech",
"filter" : ["asciifolding"]
},
"desky_3" : {
"type" : "czech",
"filter" : ["icu_folding"]
}
},
"filter" : {
"czech_stemmer" : {
"type" : "stemmer",
"name" : "czech"
},
"czech_stemmer_2" : {
"type" : "czech_stem"
},
"czech_stop" : {
"type" : "stop",
"stopwords" : ["czech"]
}
}
...

thanks

On Thursday, 18 April 2013 23:50:19 UTC+10, gondo wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/**otherapps/oregan/intro.htmlhttp://snowball.tartarus.org/otherapps/oregan/intro.htmlbut i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

i do not care about word bending too much. by that i mean, that i do not
require transferring "**Přihláška", "Přihlášky" .. into "prihlask"

right now, the only solution i can think of, is filtering those
characters in the application.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@lukas: thanks for clarification. i've ended up with this settings for now
and will revisit if after some time and some more data:
"analysis" : {
"analyzer" : {
"default" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["standard", "lowercase",
"czech_stop", "czech_stemmer", "asciifolding"]
}
},
"filter" : {
"czech_stemmer" : {
"type" : "stemmer",
"name" : "czech"
},
"czech_stop" : {
"type" : "stop",
"stopwords" : ["czech"]
}
}
}

On Thursday, 18 April 2013 23:50:19 UTC+10, gondo wrote:

hi

is it possible to use lang analyzer together with asciifolding tokenizer
or icu_tokenizer?
i've tried icu, but they do not support Czech. i found some attempt to
make it work http://snowball.tartarus.org/otherapps/oregan/intro.html but
i do not know how to integrate this with ES to try it.

basically what i would like to happen is:
input string (1):
Přihláška
indexed strings (2):
přihláška
prihlaska

i do not care about word bending too much. by that i mean, that i do not
require transferring "Přihláška", "Přihlášky" .. into "prihlask"

right now, the only solution i can think of, is filtering those characters
in the application.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

to correct myself for anyone who might be reading this, default analyzer
should be "default" and not "default"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.