Problem with word-separators in bool search with standard tokenizer

Ankush_Jhalani · September 19, 2014, 3:05pm

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.

gist.github.com

https://gist.github.com/ajhalani/3def3ea7caec5cd58490

bool search - word separator issue

POST testindex
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "document": {
      "dynamic": "strict",
      "_all": {
        "enabled": false

This file has been truncated. show original

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2abbc24-52d5-4567-afa3-66610956ce0b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ankush_Jhalani · September 19, 2014, 3:12pm

On other hand, If I use a single query_string instead of bool of terms it
works. Does ES/lucene determines not to use the word-separators by looking
at the definition of the fields.

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
elasticsearch - bool search - word separator issue · GitHub

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4b205133-eecd-490a-a028-9a53a3230973%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ankush_Jhalani · September 22, 2014, 4:19pm

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
elasticsearch - bool search - word separator issue · GitHub

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · September 23, 2014, 3:33am

The query string query is working because the ampersand is also being
stripped from the query.

Your best bet is to use the pattern tokenizer and explicitly define which
characters to split the input text on.

Cheers,

Ivan

On Mon, Sep 22, 2014 at 9:19 AM, Ankush Jhalani ankush.jhalani@gmail.com
wrote:

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist. https://gist.github.com/
ajhalani/3def3ea7caec5cd58490

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e7dfb594-58c1-4127-8ae7-73f2c1f0adca%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCqrVg8kWgArY_t5paHSCeEG9LWAdv_0Q2rm9vdcnPqeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Bryan_Warner · September 23, 2014, 2:03pm

Hi Ankush,

A few weeks ago I released an Elasticsearch plugin that allows you to
override the default word boundary properties for Unicode characters as
implemented by the StandardTokenizer algorithm. I had the same issue where
I wanted to use the StandardTokenizer but override the word boundary
properties for special characters like '#', '@', etc. (for example, treat
them the same way as the '_' , which is categorized as an extended
num-letter)

Plugin: GitHub - bbguitar77/elasticsearch-analysis-standardext

I hope this helps solve your issue.

Thanks
Bryan

On Monday, September 22, 2014 12:19:10 PM UTC-4, Ankush Jhalani wrote:

just checking back if anyone has any ideas.. thanks!

On Friday, September 19, 2014 11:05:59 AM UTC-4, Ankush Jhalani wrote:

In our search we have configured text with 2 analyzers, english and
standard so we can match phrases on the standard-analyzer. We break the
keywords by space, and create a bool query for each word.

This is working fine for all cases except where the query has standard
word-separators like & (ampersand), ; (semi-colon), etc. As
word-separators are stripped in index by analyzer, searching for them
returns 0 results. Gist.
elasticsearch - bool search - word separator issue · GitHub

I don't want to use a whitespace analyzer because we do actually want to
ignore word separators. I was thinking about hacky workarounds like
removing all standalone non-alphanumeric characters, or moving them in
"should" instead of default "must" (in case we do have analyzers in future
that are whitespace).

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6af13c45-93e5-4a8e-9520-88fdc14056f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Changing tokenizer from whitespace to standard Elasticsearch	4	2559	July 6, 2017
Standard analyzer Elasticsearch	6	325	June 6, 2019
Elasticsearch english words analyzer Elasticsearch	3	412	July 6, 2017
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	302	July 6, 2017
WhiteSpaceTokenizer buffer_size Elasticsearch	6	1272	July 5, 2017

Problem with word-separators in bool search with standard tokenizer

Related topics