Problem with word-separators in bool search with standard tokenizer

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2abbc24-52d5-4567-afa3-66610956ce0b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On other hand, If I use a single query_string instead of bool of terms it
works. Does ES/lucene determines not to use the word-separators by looking
at the definition of the fields.

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
https://gist.github.com/ajhalani/3def3ea7caec5cd58490

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4b205133-eecd-490a-a028-9a53a3230973%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
https://gist.github.com/ajhalani/3def3ea7caec5cd58490

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The query string query is working because the ampersand is also being
stripped from the query.

Your best bet is to use the pattern tokenizer and explicitly define which
characters to split the input text on.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

Cheers,

Ivan

On Mon, Sep 22, 2014 at 9:19 AM, Ankush Jhalani ankush.jhalani@gmail.com
wrote:

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist. https://gist.github.com/
ajhalani/3def3ea7caec5cd58490

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCqrVg8kWgArY_t5paHSCeEG9LWAdv_0Q2rm9vdcnPqeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Ankush,

A few weeks ago I released an ElasticSearch plugin that allows you to
override the default word boundary properties for Unicode characters as
implemented by the StandardTokenizer algorithm. I had the same issue where
I wanted to use the StandardTokenizer but override the word boundary
properties for special characters like '#', '@', etc. (for example, treat
them the same way as the '_' , which is categorized as an extended
num-letter)

Plugin: https://github.com/bbguitar77/elasticsearch-analysis-standardext

I hope this helps solve your issue.

Thanks
Bryan

On Monday, September 22, 2014 12:19:10 PM UTC-4, Ankush Jhalani wrote:

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
https://gist.github.com/ajhalani/3def3ea7caec5cd58490

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6af13c45-93e5-4a8e-9520-88fdc14056f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.