Optional asciifolding


(Nik Everett) #1

I'm looking to make asciifolding optional in my (English) index. If the
user searches without any high ascii characters then I want to match
against the folded tokens. If the user searches with high ascii characters
then I only want to match the unfolded tokens. Is this possible with
Elasticsearch right now?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3w%2BNHJZQkcCRnEKuowAuObkBTVbHEhnCFpkLH7y0Pa0Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Itamar Syn-Hershko) #2

You will have to use 2 fields, or multiple terms on the same position. In a
recent project we found a nice way of dealing with that on the same field,
I hope to have a blog post about that soon..

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 28, 2014 at 6:00 PM, Nikolas Everett nik9000@gmail.com wrote:

I'm looking to make asciifolding optional in my (English) index. If the
user searches without any high ascii characters then I want to match
against the folded tokens. If the user searches with high ascii characters
then I only want to match the unfolded tokens. Is this possible with
Elasticsearch right now?

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3w%2BNHJZQkcCRnEKuowAuObkBTVbHEhnCFpkLH7y0Pa0Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuXY6wTDZJNEmwrXN8dRESSYJrKSkcHvSC6KkzYp4TLtg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #3

I'd prefer multiple terms in the same position if I can get away with it.
That way it'd all be configured by the analyzer so it wouldn't add any
extra complexity to other languages. It'd take up much less space that way
as well.

On Tue, Jan 28, 2014 at 11:04 AM, Itamar Syn-Hershko itamar@code972.comwrote:

You will have to use 2 fields, or multiple terms on the same position. In
a recent project we found a nice way of dealing with that on the same
field, I hope to have a blog post about that soon..

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 28, 2014 at 6:00 PM, Nikolas Everett nik9000@gmail.comwrote:

I'm looking to make asciifolding optional in my (English) index. If the
user searches without any high ascii characters then I want to match
against the folded tokens. If the user searches with high ascii characters
then I only want to match the unfolded tokens. Is this possible with
Elasticsearch right now?

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3w%2BNHJZQkcCRnEKuowAuObkBTVbHEhnCFpkLH7y0Pa0Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuXY6wTDZJNEmwrXN8dRESSYJrKSkcHvSC6KkzYp4TLtg%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2imi0WzibxZC_KmeK0J139fR6zjB5H0ij1fdLoxvzJzQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Itamar Syn-Hershko) #4

Ok so the idea is you store each term twice - once stemmed (+ ascii folded

  • whatever) and once just lowercased, and add a character (we used $) to
    mark that term as the "original".

You can see it in action here:
https://github.com/synhershko/elasticsearch-analysis-hebrew/blob/master/src/main/java/com/code972/elasticsearch/analysis/HebrewIndexingAnalyzer.java#L20
(warning: plugin still under work, and is using some non-traditional
methods to do stuff)

There's some details to take into account - like how to search for the
original etc, but if you'll look at the code there you'll get an idea of
how its done

We did that also for non-Hebrew and non-English texts. It works quite
nicely, but it doubles the amount of terms in your index.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 28, 2014 at 6:09 PM, Nikolas Everett nik9000@gmail.com wrote:

I'd prefer multiple terms in the same position if I can get away with it.
That way it'd all be configured by the analyzer so it wouldn't add any
extra complexity to other languages. It'd take up much less space that way
as well.

On Tue, Jan 28, 2014 at 11:04 AM, Itamar Syn-Hershko itamar@code972.comwrote:

You will have to use 2 fields, or multiple terms on the same position. In
a recent project we found a nice way of dealing with that on the same
field, I hope to have a blog post about that soon..

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Tue, Jan 28, 2014 at 6:00 PM, Nikolas Everett nik9000@gmail.comwrote:

I'm looking to make asciifolding optional in my (English) index. If
the user searches without any high ascii characters then I want to match
against the folded tokens. If the user searches with high ascii characters
then I only want to match the unfolded tokens. Is this possible with
Elasticsearch right now?

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3w%2BNHJZQkcCRnEKuowAuObkBTVbHEhnCFpkLH7y0Pa0Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuXY6wTDZJNEmwrXN8dRESSYJrKSkcHvSC6KkzYp4TLtg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2imi0WzibxZC_KmeK0J139fR6zjB5H0ij1fdLoxvzJzQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt2LWsXrQ-fXL49u-azWpGvCqvUh-%2BdN13nYT1SqrOFEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5