Text query on multiple fields

I used to use the 'query_string' query type to run searches across multiple
fields (via the build in dismax capability). However, the Lucene parsing of
the query phrase causes more harm than good for me so I thought to move to
the text query family (that would only analyze but not parse the search
phrase).

This works fine as long as I only have one field to search. In case of
multiple fields things become difficult as the text query family is
strictly single fielded. I understand that I would have to construct a
boolean (and) query per term with a dismax per field to achieve what
'query_string' is doing implicitly. However, this approach would require
analyzing the search phrase (to get to its terms) before I can construct
the correct query. Doing this analyze would mean another roundtrip via the
analyze API and I could not use the text query with the analyzed terms.
This does not seem right.

So my question is whether I am missing something or whether there is a
certain mismatch in the capabilities of the text vs. query_string API?
Could the text query family be extended to support multiple fields ?

Hi Jan

On Wed, 2012-03-14 at 09:14 -0700, Jan Fiedler wrote:

I used to use the 'query_string' query type to run searches across
multiple fields (via the build in dismax capability). However, the
Lucene parsing of the query phrase causes more harm than good for me
so I thought to move to the text query family (that would only analyze
but not parse the search phrase).

This works fine as long as I only have one field to search. In case of
multiple fields things become difficult as the text query family is
strictly single fielded. I understand that I would have to construct a
boolean (and) query per term with a dismax per field to achieve what
'query_string' is doing implicitly. However, this approach would
require analyzing the search phrase (to get to its terms) before I can
construct the correct query. Doing this analyze would mean another
roundtrip via the analyze API and I could not use the text query with
the analyzed terms. This does not seem right.

Doesn't this do what you want?

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"dis_max" : {
"queries" : [
{
"text" : {
"name" : "foo bar"
}
},
{
"text" : {
"title" : "foo bar"
}
}
]
}
}
}
'

clint

Well, not sure. This would use the default operator of the text query
(which is 'OR'). I have not tested it but I would assume I would end up
hitting documents that only have one of the terms (in either name or
title). What I need is documents that have both terms (i.e. 'foo' and
'bar') in either the 'name' or the 'title' field.

On Wed, 2012-03-14 at 09:44 -0700, Jan Fiedler wrote:

Well, not sure. This would use the default operator of the text query
(which is 'OR'). I have not tested it but I would assume I would end
up hitting documents that only have one of the terms (in either name
or title). What I need is documents that have both terms (i.e. 'foo'
and 'bar') in either the 'name' or the 'title' field.

:wink:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"dis_max" : {
"queries" : [
{
"text" : {
"name" : {
"operator" : "and",
"query" : "foo bar"
}
}
},
{
"text" : {
"title" : {
"operator" : "and",
"query" : "foo bar"
}
}
}
]
}
}
}
'

I am pretty sure this is still not what I need and what query_string is
providing. The query will now insist that both terms (i.e. 'foo' and 'bar')
are present in a single field. It will not match documents that have 'foo'
in the 'name' field and 'bar' in the 'title' field. This is what I tried to
get at in my first post. You would need to parse the phrase 'foo bar' to
get the terms such that you could build a bool (and) query per term with
dis_max queries over fields. I bet this is what the 'query_string' is doing
internally when mapping to Lucene. It is just missing for the text query
(or the 'query_string' should have a mode to disable parsing such that it
only analyzes).

On Wed, 2012-03-14 at 11:57 -0700, Jan Fiedler wrote:

I am pretty sure this is still not what I need and what query_string
is providing. The query will now insist that both terms (i.e. 'foo'
and 'bar') are present in a single field. It will not match documents
that have 'foo' in the 'name' field and 'bar' in the 'title' field.
This is what I tried to get at in my first post. You would need to
parse the phrase 'foo bar' to get the terms such that you could build
a bool (and) query per term with dis_max queries over fields. I bet
this is what the 'query_string' is doing internally when mapping to
Lucene. It is just missing for the text query (or the 'query_string'
should have a mode to disable parsing such that it only analyzes).

OK, I misunderstood your previous email.

It may just be easier to sanitise the user input and use the query
string. This is what I do in my Perl module:

https://metacpan.org/module/ElasticSearch::Util#filter_keywords-

#===================================
sub filter_keywords {
#===================================
local $_ = shift;

s{[^[:alpha:][:digit:] \-+'"*@\._]+}{ }g;

return '' unless /[[:alpha:][:digit:]]/;

s/\s*\b(?:and|or|not)\b\s*/ /gi;

# remove '-' that don't have spaces before them
s/(?<! )-/\ /g;

# remove the spaces after a + or -
s/([+-])\s+/$1/g;

# remove + or - not followed by a letter, number or "
s/[+-](?![[:alpha:][:digit:]"])/ /g;

# remove * without 3 char prefix
s/(?<![[:alpha:][:digit:]\-@\._]{3})\*/ /g;

# ensure quotes are closed
my $quotes = (tr/"//);
if ( $quotes % 2 ) { $_ .= '"' }

s/^\s+//;
s/\s+$//;

return $_;

}

clint

The reason why its simpler to do this with the query_string with multiple
fields is because the query parser for query_string already breaks the
words by whitespace (to parse the relevant syntax), so effectively, its
building the dis max around queries generated by that query_string
whitepsace tokenization (each one is also further analyzed). The text query
simply takes the text and analyzes it, generating the relevant query.

On Wed, Mar 14, 2012 at 8:57 PM, Jan Fiedler fiedler.jan@gmail.com wrote:

I am pretty sure this is still not what I need and what query_string is
providing. The query will now insist that both terms (i.e. 'foo' and 'bar')
are present in a single field. It will not match documents that have 'foo'
in the 'name' field and 'bar' in the 'title' field. This is what I tried to
get at in my first post. You would need to parse the phrase 'foo bar' to
get the terms such that you could build a bool (and) query per term with
dis_max queries over fields. I bet this is what the 'query_string' is doing
internally when mapping to Lucene. It is just missing for the text query
(or the 'query_string' should have a mode to disable parsing such that it
only analyzes).