Querystring search: Tokens are out of order

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You analysis of what is going on sounds correct. However, Elasticsearch's
results are also correct. When it analyzes the search string, your query
becomes a match query on "foo" AND "bar", which matches any document
containing both of those terms. Most queries against analyzed fields do not
respect the original ordering of the terms.

One thing you could try is looking into the match_phrase query (
Phrase Matching | Elasticsearch: The Definitive Guide [master] | Elastic)
which is aware of the ordering of the terms. Using the base match_phrase
query for "foo bar" will not match either "foo xyz bar" or "bar xyz foo".
If you still need to match things like "foo xyz bar" you may be able to do
that using the slop parameter, depending on what exactly the use case is.

James

On Tue, Apr 14, 2015 at 2:03 PM, Dave Reed infinity88@gmail.com wrote:

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAABsnTZWNp65WzwYsZVZz%3DiHon7WW90EO8SUKbnB4aHuKcd-og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:

The query string is parsed into a series of terms and operators. A term

can be a single word — quick or brown — or a phrase, surrounded by double
quotes — "quick brown" — which searches for all the words in the phrase,
in the same order.

So what gives? :slight_smile:

On Tuesday, April 14, 2015 at 1:15:24 PM UTC-7, James Macdonald wrote:

You analysis of what is going on sounds correct. However, Elasticsearch's
results are also correct. When it analyzes the search string, your query
becomes a match query on "foo" AND "bar", which matches any document
containing both of those terms. Most queries against analyzed fields do not
respect the original ordering of the terms.

One thing you could try is looking into the match_phrase query (
Phrase Matching | Elasticsearch: The Definitive Guide [master] | Elastic)
which is aware of the ordering of the terms. Using the base match_phrase
query for "foo bar" will not match either "foo xyz bar" or "bar xyz foo".
If you still need to match things like "foo xyz bar" you may be able to do
that using the slop parameter, depending on what exactly the use case is.

James

On Tue, Apr 14, 2015 at 2:03 PM, Dave Reed <infin...@gmail.com
<javascript:>> wrote:

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a355b94-358f-4c5a-ac16-31ac7a0c0abe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

To perhaps answer my own question, I think I understand the difference.

details:"foo bar"

Would search for the tokens in the same order (implied by the docs I
referenced). But

details:foo-bar

Would not honor the order. The quotes have more meaning than to enclose the
phrase... if that is true then these two queries are not the same, which is
different than I thought:

details:foo\ bar
!=
details:"foo bar"

Or am I barking up the wrong tree...

On Tuesday, April 14, 2015 at 1:34:28 PM UTC-7, Dave Reed wrote:

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:

Query string query | Elasticsearch Guide [8.11] | Elastic

The query string is parsed into a series of terms and operators. A

term can be a single word — quick or brown — or a phrase, surrounded by
double quotes — "quick brown" — which searches for all the words in the
phrase, in the same order.

So what gives? :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You understanding is correct. The former will be translated into a Lucene
phrase query, which uses the term doc positions to find matches.

Both query terms are analyzed, but the latter will simply be a bag-of-words
query, which ignores positions.

Cheers,

Ivan
On Apr 14, 2015 10:38 PM, "Dave Reed" infinity88@gmail.com wrote:

To perhaps answer my own question, I think I understand the difference.

details:"foo bar"

Would search for the tokens in the same order (implied by the docs I
referenced). But

details:foo-bar

Would not honor the order. The quotes have more meaning than to enclose
the phrase... if that is true then these two queries are not the same,
which is different than I thought:

details:foo\ bar
!=
details:"foo bar"

Or am I barking up the wrong tree...

On Tuesday, April 14, 2015 at 1:34:28 PM UTC-7, Dave Reed wrote:

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:
Elasticsearch Guide | Elastic
current/query-dsl-query-string-query.html

The query string is parsed into a series of terms and operators. A

term can be a single word — quick or brown — or a phrase, surrounded by
double quotes — "quick brown" — which searches for all the words in the
phrase, in the same order.

So what gives? :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBZOjqZ6xU8Y2%3Dh6BmBWOqms53yrix5eJsWXq9E6meYbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.