Query_string queries and special characters

We've been using query_string queries with ElasticSearch as part of a quick
proof of concept in the company. While we have had pretty good success, we
have seen some things we don't understand as well. Hoping someone can
shed some light/confirm some suspicions.

We have field (call it field1) in our content for which we defined a
mapping with a custom analyzer (keyword tokenizer, lower case filter). The
field has textual data (including some non-alphanum characters such as '-'
and '/'). An example might be: Fksdj-hfge/76543-89-0. Running the
_analyze endpoint shows that it is being treated as one token and has been
lowercased as expected. In the index, every document has a unique value in
this field. We also have a default all field set on the index.

When we submit a query like this: field1:Fksdj-hfge/76543-89-0 we get no
answers
When I escape the '/' like this: field1:Fksdj-hfge/76543-89-0 it finds
the document.

Based on the results, I assume that unlike match queries, query_string
doesn't apply the analyzer from the field being searched to the query.
Assuming that is true, some questions:

What analyzer is used by query_string queries to process the search string
by default?
Do you need to escape any special/non-alphanum character to get it to pass
through the query parser (assuming we let it use it's default analyzer)?
I assume the analyzer parameter on the query_string query refers to the
query parser's analyzer, will the query_string query select the correct
analyzer for the specified field once it gets past parsing the query?

Thanks,
Curt

PS: I know I can use term queries, however we are trying to hook into an
existing system that is providing Lucene syntax queries and were trying to
avoid the extra development for the proof of concept.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Query_string passes the query straight through to the Lucene parser, so it
acts just like the Lucene QueryParser. Specifically, the parser will:

  1. Tokenize your query with it's own tokenizer (not the one in your
    analyzer) so as to find tokens/phrases and special characters
  2. Rewrite your query to use any special operations (fuzzy, wildcard,
    etc)
  3. Pass the tokens/phrases to the field's analyzer for analysis.

So to answer your question, you'll have to escape special characters (a
full list can be found here<Apache Lucene - Query Parser Syntax
Special Characters>) so that the QueryParser keeps them as part of the
token. Once the query has made it through the "query parsing" phase, the
leftover tokens will be passed to the fields analyzer (or whatever analyzer
you specify in your query).

-Zach

On Tuesday, August 27, 2013 8:13:38 AM UTC-4, Curt Kohler wrote:

We've been using query_string queries with Elasticsearch as part of a
quick proof of concept in the company. While we have had pretty good
success, we have seen some things we don't understand as well. Hoping
someone can shed some light/confirm some suspicions.

We have field (call it field1) in our content for which we defined a
mapping with a custom analyzer (keyword tokenizer, lower case filter). The
field has textual data (including some non-alphanum characters such as '-'
and '/'). An example might be: Fksdj-hfge/76543-89-0. Running the
_analyze endpoint shows that it is being treated as one token and has been
lowercased as expected. In the index, every document has a unique value in
this field. We also have a default all field set on the index.

When we submit a query like this: field1:Fksdj-hfge/76543-89-0 we get no
answers
When I escape the '/' like this: field1:Fksdj-hfge/76543-89-0 it finds
the document.

Based on the results, I assume that unlike match queries, query_string
doesn't apply the analyzer from the field being searched to the query.
Assuming that is true, some questions:

What analyzer is used by query_string queries to process the search string
by default?
Do you need to escape any special/non-alphanum character to get it to pass
through the query parser (assuming we let it use it's default analyzer)?
I assume the analyzer parameter on the query_string query refers to the
query parser's analyzer, will the query_string query select the correct
analyzer for the specified field once it gets past parsing the query?

Thanks,
Curt

PS: I know I can use term queries, however we are trying to hook into an
existing system that is providing Lucene syntax queries and were trying to
avoid the extra development for the proof of concept.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.