Querystring search: Tokens are out of order

Dave_Reed · April 14, 2015, 7:03pm

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James_Macdonald · April 14, 2015, 8:14pm

You analysis of what is going on sounds correct. However, Elasticsearch's
results are also correct. When it analyzes the search string, your query
becomes a match query on "foo" AND "bar", which matches any document
containing both of those terms. Most queries against analyzed fields do not
respect the original ordering of the terms.

One thing you could try is looking into the match_phrase query (
Phrase Matching | Elasticsearch: The Definitive Guide [master] | Elastic)
which is aware of the ordering of the terms. Using the base match_phrase
query for "foo bar" will not match either "foo xyz bar" or "bar xyz foo".
If you still need to match things like "foo xyz bar" you may be able to do
that using the slop parameter, depending on what exactly the use case is.

James

On Tue, Apr 14, 2015 at 2:03 PM, Dave Reed infinity88@gmail.com wrote:

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAABsnTZWNp65WzwYsZVZz%3DiHon7WW90EO8SUKbnB4aHuKcd-og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dave_Reed · April 14, 2015, 8:34pm

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:

The query string is parsed into a series of terms and operators. A term

can be a single word — quick or brown — or a phrase, surrounded by double
quotes — "quick brown" — which searches for all the words in the phrase,
in the same order.

So what gives?

On Tuesday, April 14, 2015 at 1:15:24 PM UTC-7, James Macdonald wrote:

You analysis of what is going on sounds correct. However, Elasticsearch's
results are also correct. When it analyzes the search string, your query
becomes a match query on "foo" AND "bar", which matches any document
containing both of those terms. Most queries against analyzed fields do not
respect the original ordering of the terms.

One thing you could try is looking into the match_phrase query (
Phrase Matching | Elasticsearch: The Definitive Guide [master] | Elastic)
which is aware of the ordering of the terms. Using the base match_phrase
query for "foo bar" will not match either "foo xyz bar" or "bar xyz foo".
If you still need to match things like "foo xyz bar" you may be able to do
that using the slop parameter, depending on what exactly the use case is.

James

On Tue, Apr 14, 2015 at 2:03 PM, Dave Reed <infin...@gmail.com
<javascript:>> wrote:

I have the following search:

{
"query": {
"filtered": {
"query": {
"query_string": {
"default_operator": "AND",
"query": "details:foo\-bar"
}
},
"filter": {
"term": {
"deleted": false
}
}
}
}
}

The details field is analyzed using pattern tokenizer, as so:

settings: {
index.analysis.analyzer.letterordigit.pattern: "[^\p{L}\p{N}]+",
index.analysis.analyzer.letterordigit.type: "pattern"
}

This breaks the field into tokens separated by any non-letter or
non-numeric character.

But the user is searching for "foo-bar" which contains a non alphanumeric
character. I assume, but correct me if I'm wrong, that ES will apply the
same analyzer to that string. So it is broken into two tokens: ["foo",
"bar"], and then the default_operator kicks in and essentially turns the
query into "details:foo AND detail:bar".

My problem is that it will match documents containing "foo xyz bar" and
"bar xyz foo" -- in the latter case, the tokens are in the reverse order
from the user's search. I'm fine with it matching the former, but it's a
stretch to convince the user that the latter is intended.

The search string is provided by the user, so I can't really build a
complex query with different query types, hence the basic querystring
search.

Any advice or corrections to my assumptions is appreciated!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4a204214-f209-48dd-a13a-96463609ad7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a355b94-358f-4c5a-ac16-31ac7a0c0abe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dave_Reed · April 14, 2015, 8:38pm

To perhaps answer my own question, I think I understand the difference.

details:"foo bar"

Would search for the tokens in the same order (implied by the docs I
referenced). But

details:foo-bar

Would not honor the order. The quotes have more meaning than to enclose the
phrase... if that is true then these two queries are not the same, which is
different than I thought:

details:foo\ bar
!=
details:"foo bar"

Or am I barking up the wrong tree...

On Tuesday, April 14, 2015 at 1:34:28 PM UTC-7, Dave Reed wrote:

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:

Query string query | Elasticsearch Guide [8.11] | Elastic

The query string is parsed into a series of terms and operators. A

term can be a single word — quick or brown — or a phrase, surrounded by
double quotes — "quick brown" — which searches for all the words in the
phrase, in the same order.

So what gives?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · April 15, 2015, 8:32am

You understanding is correct. The former will be translated into a Lucene
phrase query, which uses the term doc positions to find matches.

Both query terms are analyzed, but the latter will simply be a bag-of-words
query, which ignores positions.

Cheers,

Ivan
On Apr 14, 2015 10:38 PM, "Dave Reed" infinity88@gmail.com wrote:

To perhaps answer my own question, I think I understand the difference.

details:"foo bar"

Would search for the tokens in the same order (implied by the docs I
referenced). But

details:foo-bar

Would not honor the order. The quotes have more meaning than to enclose
the phrase... if that is true then these two queries are not the same,
which is different than I thought:

details:foo\ bar
!=
details:"foo bar"

Or am I barking up the wrong tree...

On Tuesday, April 14, 2015 at 1:34:28 PM UTC-7, Dave Reed wrote:

Thanks, though unless I am misunderstanding it, the docs imply otherwise:

For example, from:
Elasticsearch Guide | Elastic
current/query-dsl-query-string-query.html

The query string is parsed into a series of terms and operators. A

term can be a single word — quick or brown — or a phrase, surrounded by
double quotes — "quick brown" — which searches for all the words in the
phrase, in the same order.

So what gives?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b28591e3-3818-4b12-8a22-cac466c9ec7c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBZOjqZ6xU8Y2%3Dh6BmBWOqms53yrix5eJsWXq9E6meYbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Indexing and searching for string '?' Elasticsearch	2	322	July 6, 2017
Query_string queries and special characters Elasticsearch	2	4213	July 6, 2017
Issue with query_string query Elasticsearch	1	287	April 28, 2021
Search for substrings in specific order of appearance Kibana	7	957	May 29, 2019
Highlighting issue with wildcard query string query Elasticsearch	4	2044	July 5, 2017

Querystring search: Tokens are out of order

Related topics