Note, the "context of the index" you mention is hard to deduce - in ES,
the SearchContext does not hold information about index term statistics
or term lists, it operates on shard level (Lucene index level) and does
not accumulate knowledge about the higher ES index level.
At each node a SearchService is active, which is holding a
SearchContext. When a new query arrives, it is passed to the shards in
string form (the "source" or the "extraSource").
The query source will get parsed by the SearchService at the
SearchContext creation time and this depends on the field mapping, where
the field mapping is fetched from the cluster state.
If you use scan/scroll API, you control the SearchService to reuse the
SearchContexts of the participating shards, and the query will not get
parsed again.
If you want to reuse a parsed query across searches (and across
searching clients), you have some challenges:
- what is the lifespan of a valid parsed query?
- what happens when the mapping changes, should a parsed query respect
the change?
- what is the most convenient policy for caching parsed queries? The top
used queries? How to coordinate the nodes to count the parsed queries then?
- how about enhancing query statements with binding variables?
This housekeeping of a distributed query cache comes with a certain
overhead in a distributed system. Except for a few edge cases, I expect
it is faster to simply reparse the query (for example, if you only have
one query which does never change - very unusual). With the query DSL
language, it is straightforward to translate a source string into a
Lucene Query object.
Your use case of filtering out shingles not present in an index to
reduce the number of boolean OR terms in a query might be solved by
using a spellcheck algorithm variant, which is nothing but checking for
a term existing in the index. The suggestion result could be a
preprocessed filter term list. This should be very fast in Lucene 4
because the use of an FSA in the Lucene term dictionary. Or even better,
you can replace such a check by periodically pulling out the term list
of all words in the dictionary of an index, to get it processed off-line
at client side during query construction time, maybe once a day or so
(it depends on the frequency of updates and the arrival of new terms).
Just my 2c.
Jörg
Am 30.04.13 23:02, schrieb Otis Gospodnetic:
Hi,
When a query hits an ES node, where does it get parsed?
Does it get parsed on the node that received the query OR on each
individual node (or shard!) that executes the query?
And is there any way for clients to "pre-parse" the query and avoid
query parsing/rewriting at node/shard level, where I suspect the
parsing is currently done?
For example, the use case I have in mind is a system that does some
app-specific query building, including shingling. This app has the
knowledge about the context of the index and could then say "Ah, I
know this shingle never appears in the index, so remove it from the
query because it will just waste cycles".
Thanks,
Otis
ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.