Query_string search containing a dash has unexpected results

I'm not using the standard analyzer, I'm using a pattern that will break
the text on all non-word characters, like this:

"analyzer": {
"letterordigit": {
"type": "pattern",
"pattern": "[^\p{L}\p{N}]+"
}
}

I have verified that the message field is being broke up into the tokens I
expect (example in my first post).

So when I run a search for message:welcome-doesnotmatch, I'm expecting that
string to be broken into tokens like so:

welcome
doesnotmatch

And for the search to therefore find 0 documents. But it doesn't -- it
finds 1 document, the document that contains my sample message, which does
not include the token "doesnotmatch".

So why on Earth would this search match that document? It is behaving as if
everything after the "-" is completely ignored. It does not matter what I
put there, it will still match the document.

This is coming up because an end user is searching for a hyphenated word,
like "battle-axe", and it's matching a document that does not contain the
word "axe" at all.

On Friday, November 7, 2014 12:24:30 AM UTC-8, Jun Ohtani wrote:

Hi Dave,

I think the reason is your "message" field using "standard analyzer".
Standard analyzer divide text by "-".
If you change analyzer to whitespace analyzer, it matches 0 documents.

_validate API is useful for checking exact query.
Example request:

curl -XGET "/YOUR_INDEX/_validate/query?explain" -d'
{
"query": {
"query_string": {
"query": "id:3955974 AND message:welcome-doesnotmatchanything"
}
}
}'

You can get the following response. In this example, "message" field is
"index": "not_analyzed".
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "YOUR_INDEX,
"valid": true,
"explanation": "+id:3955974 +message:welcome-doesnotmatchanything"
}
]
}

See:
Elasticsearch Platform — Find real-time answers at scale | Elastic

I hope that those help you out.

Regards,
Jun

2014-11-07 9:47 GMT+09:00 Dave Reed <infin...@gmail.com <javascript:>>:

I have a document with a field "message", that contains the following
text (truncated):

Welcome to test.com!

The assertion field is mapped to have an analyzer that breaks that string
into the following tokens:

welcome
to
test
com

But, when I search with a query like this:

{
"query": {

"query_string": {
  "query": "id:3955974 AND message:welcome-doesnotmatchanything"
}

}
}

To my surprise, it finds the document (3955974 is the document id). The
dash and everything after it seems to be ignored, because it does not
matter what I put there, it will still match the document.

I've tried escaping it:

{
"query": {
"query_string": {
"query": "id:3955974 AND message:welcome\-doesnotmatchanything"
}
}
}
(note the double escape since it has to be escaped for the JSON too)

But that makes no difference. I still get 1 matching document. If I put
it in quotes it works:

{
"query": {
"query_string": {
"query": "id:3955974 AND message:"welcome-doesnotmatchanything""
}
}
}

It works, meaning it matches 0 documents, since that document does not
contain the "doesnotmatchanything" token. That's great, but I don't
understand why the unquoted version does not work. This query is being
generated so I can't easily just decide to start quoting it, and I can't
always do that anyway since the user is sometimes going to use wildcards,
which can't be quoted if I want them to function. I was under the
assumption that an EscapedUnquotedString is the same as a quoted unespaced
string (in other words, foo:a\b\c === foo:"abc", assuming all special
characters are escaped in the unquoted version).

I'm only on ES 1.01, but I don't see anything new or changes that would
have impacted this behavior in later versions.

Any insights would be helpful! :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1dbfa1d5-7301-460b-ae9c-3665cfa79c96%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1dbfa1d5-7301-460b-ae9c-3665cfa79c96%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--

Jun Ohtani
blog : http://blog.johtani.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/26a1cf96-b89b-4729-a2b1-58ba79c425a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.