I used to use the 'query_string' query type to run searches across multiple
fields (via the build in dismax capability). However, the Lucene parsing of
the query phrase causes more harm than good for me so I thought to move to
the text query family (that would only analyze but not parse the search
phrase).
This works fine as long as I only have one field to search. In case of
multiple fields things become difficult as the text query family is
strictly single fielded. I understand that I would have to construct a
boolean (and) query per term with a dismax per field to achieve what
'query_string' is doing implicitly. However, this approach would require
analyzing the search phrase (to get to its terms) before I can construct
the correct query. Doing this analyze would mean another roundtrip via the
analyze API and I could not use the text query with the analyzed terms.
This does not seem right.
So my question is whether I am missing something or whether there is a
certain mismatch in the capabilities of the text vs. query_string API?
Could the text query family be extended to support multiple fields ?
On Wed, 2012-03-14 at 09:14 -0700, Jan Fiedler wrote:
I used to use the 'query_string' query type to run searches across
multiple fields (via the build in dismax capability). However, the
Lucene parsing of the query phrase causes more harm than good for me
so I thought to move to the text query family (that would only analyze
but not parse the search phrase).
This works fine as long as I only have one field to search. In case of
multiple fields things become difficult as the text query family is
strictly single fielded. I understand that I would have to construct a
boolean (and) query per term with a dismax per field to achieve what
'query_string' is doing implicitly. However, this approach would
require analyzing the search phrase (to get to its terms) before I can
construct the correct query. Doing this analyze would mean another
roundtrip via the analyze API and I could not use the text query with
the analyzed terms. This does not seem right.
Well, not sure. This would use the default operator of the text query
(which is 'OR'). I have not tested it but I would assume I would end up
hitting documents that only have one of the terms (in either name or
title). What I need is documents that have both terms (i.e. 'foo' and
'bar') in either the 'name' or the 'title' field.
On Wed, 2012-03-14 at 09:44 -0700, Jan Fiedler wrote:
Well, not sure. This would use the default operator of the text query
(which is 'OR'). I have not tested it but I would assume I would end
up hitting documents that only have one of the terms (in either name
or title). What I need is documents that have both terms (i.e. 'foo'
and 'bar') in either the 'name' or the 'title' field.
I am pretty sure this is still not what I need and what query_string is
providing. The query will now insist that both terms (i.e. 'foo' and 'bar')
are present in a single field. It will not match documents that have 'foo'
in the 'name' field and 'bar' in the 'title' field. This is what I tried to
get at in my first post. You would need to parse the phrase 'foo bar' to
get the terms such that you could build a bool (and) query per term with
dis_max queries over fields. I bet this is what the 'query_string' is doing
internally when mapping to Lucene. It is just missing for the text query
(or the 'query_string' should have a mode to disable parsing such that it
only analyzes).
On Wed, 2012-03-14 at 11:57 -0700, Jan Fiedler wrote:
I am pretty sure this is still not what I need and what query_string
is providing. The query will now insist that both terms (i.e. 'foo'
and 'bar') are present in a single field. It will not match documents
that have 'foo' in the 'name' field and 'bar' in the 'title' field.
This is what I tried to get at in my first post. You would need to
parse the phrase 'foo bar' to get the terms such that you could build
a bool (and) query per term with dis_max queries over fields. I bet
this is what the 'query_string' is doing internally when mapping to
Lucene. It is just missing for the text query (or the 'query_string'
should have a mode to disable parsing such that it only analyzes).
OK, I misunderstood your previous email.
It may just be easier to sanitise the user input and use the query
string. This is what I do in my Perl module:
#===================================
sub filter_keywords {
#===================================
local $_ = shift;
s{[^[:alpha:][:digit:] \-+'"*@\._]+}{ }g;
return '' unless /[[:alpha:][:digit:]]/;
s/\s*\b(?:and|or|not)\b\s*/ /gi;
# remove '-' that don't have spaces before them
s/(?<! )-/\ /g;
# remove the spaces after a + or -
s/([+-])\s+/$1/g;
# remove + or - not followed by a letter, number or "
s/[+-](?![[:alpha:][:digit:]"])/ /g;
# remove * without 3 char prefix
s/(?<![[:alpha:][:digit:]\-@\._]{3})\*/ /g;
# ensure quotes are closed
my $quotes = (tr/"//);
if ( $quotes % 2 ) { $_ .= '"' }
s/^\s+//;
s/\s+$//;
return $_;
The reason why its simpler to do this with the query_string with multiple
fields is because the query parser for query_string already breaks the
words by whitespace (to parse the relevant syntax), so effectively, its
building the dis max around queries generated by that query_string
whitepsace tokenization (each one is also further analyzed). The text query
simply takes the text and analyzes it, generating the relevant query.
I am pretty sure this is still not what I need and what query_string is
providing. The query will now insist that both terms (i.e. 'foo' and 'bar')
are present in a single field. It will not match documents that have 'foo'
in the 'name' field and 'bar' in the 'title' field. This is what I tried to
get at in my first post. You would need to parse the phrase 'foo bar' to
get the terms such that you could build a bool (and) query per term with
dis_max queries over fields. I bet this is what the 'query_string' is doing
internally when mapping to Lucene. It is just missing for the text query
(or the 'query_string' should have a mode to disable parsing such that it
only analyzes).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.