For context - lukasvlcek had the conversation below in IRC, then left.
I'm answering him here
lukasvlcek:
kimchy: I haven't been thinking about it before... what is the
rationale of not allowing analyzer setup for term query when
Query DSL is used? See
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-way-to-search-terms-lower-cased-tp932996.html
I am just curious why user has to search -exact- terms (Lower vs
Upper case)
sam_:
the default analyzer if nothing is specified is standard isn't
it?
lukasvlcek:
I did not try this particular example but I am confused by the
term query doc which explicitly says "not analyzed" (so even the
default analyzer is not used?)
sam_:
if it is not analyzed then I would suspect you need to provide
case
an exact match
the standard analyzer would result in it being converted
lukasvlcek:
wouldn't it be useful to have ability to specify analyzer?
sam_:
you can
well
at least when you define the mappings
the analyzer is used as part of the indexing
as an alternative I would think you could provide your own
parser implementation to which is what I'm trying to do
but have been unsuccessful
lukasvlcek:
but the point is if it is possible to specify analyzer when
querying via URL parameters then why can not specify analyzer
while using Query DSL
Gotta go now... but I would appreciate if anybody (kimchy?) can
follow up on that mail thread above (want to check that later)
----------------------------------------------
Answer:
(Note - this is as I understand the situation - I'm open to correction)
All data stored in ElasticSearch/Lucene is stored as a 'term' which is
atomic - it can't be broken down further.
So if you index {"text": "The quick brown fox jumped over the LAZY dog"}
then the default analyzer would:
- remove stopwords
- lowercase all text
- split on whitespace and punctuation
- result in these terms:
'quick', 'brown', 'fox', 'jumped','over', 'lazy', 'dog'
If you then do this search:
{ "query_string": { "query": "QUICK dOg"}}
Then the default analyzer would analyze your query string and return the
following terms: "quick", "dog"
It then does a 'term' query for each of those terms and combines the
results.
If you did this search:
{ "wildcard": {"text": "o}}
Then it would first look at all terms, and find only those terms that
match that pattern, ie: 'brown', 'fox', 'over', 'dog'.
It then does a 'term' query for each of those terms and combines the
results.
So it doesn't make sense to analyze a 'term'. Terms are the result of
analysis. If you need to analyse a search "phrase" then you should use a
"query_string" or "field" query.
For the same reason, you can't sort on an analyzed field because the
original data doesn't exist. It is tokenised and stored as
terms. (unless the field is also stored? - not sure)
The analyzer used to analyze a search phrase is selected in this order:
-
"analyzer" specified in the query DSL, eg:
{ "query_string": { "query": "foo bar", "analyzer": "keyword"}}
-
"search_analyzer" specified in the mapping
-
"analyzer" specified in the mapping
-
the default_search analyzer specified in the index configuration
-
the default analyzer specified in the index configuration
-
the default_search analyzer specified in the node configuration
-
the default analyzer specified in the node configuration
-
the "standard" analyzer
(I think that's right - I may have added a couple in there that don't
actually exist)
Typically, it doesn't make sense to use a different analyzer at index
and search time, because you may end up searching for terms that don't
actually exist.
If a field is set to be 'not_analyzed', then the whole value is treated
as a term, so "ABC" and "abc" are different, and "abc" will not match
"abc def".
hope this helps
Clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.