Sending pre-parsed query possible?

Hi,

When a query hits an ES node, where does it get parsed?
Does it get parsed on the node that received the query OR on each
individual node (or shard!) that executes the query?

And is there any way for clients to "pre-parse" the query and avoid query
parsing/rewriting at node/shard level, where I suspect the parsing is
currently done?

For example, the use case I have in mind is a system that does some
app-specific query building, including shingling. This app has the
knowledge about the context of the index and could then say "Ah, I know
this shingle never appears in the index, so remove it from the query
because it will just waste cycles".

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Note, the "context of the index" you mention is hard to deduce - in ES,
the SearchContext does not hold information about index term statistics
or term lists, it operates on shard level (Lucene index level) and does
not accumulate knowledge about the higher ES index level.

At each node a SearchService is active, which is holding a
SearchContext. When a new query arrives, it is passed to the shards in
string form (the "source" or the "extraSource").
The query source will get parsed by the SearchService at the
SearchContext creation time and this depends on the field mapping, where
the field mapping is fetched from the cluster state.

If you use scan/scroll API, you control the SearchService to reuse the
SearchContexts of the participating shards, and the query will not get
parsed again.

If you want to reuse a parsed query across searches (and across
searching clients), you have some challenges:

  • what is the lifespan of a valid parsed query?
  • what happens when the mapping changes, should a parsed query respect
    the change?
  • what is the most convenient policy for caching parsed queries? The top
    used queries? How to coordinate the nodes to count the parsed queries then?
  • how about enhancing query statements with binding variables?

This housekeeping of a distributed query cache comes with a certain
overhead in a distributed system. Except for a few edge cases, I expect
it is faster to simply reparse the query (for example, if you only have
one query which does never change - very unusual). With the query DSL
language, it is straightforward to translate a source string into a
Lucene Query object.

Your use case of filtering out shingles not present in an index to
reduce the number of boolean OR terms in a query might be solved by
using a spellcheck algorithm variant, which is nothing but checking for
a term existing in the index. The suggestion result could be a
preprocessed filter term list. This should be very fast in Lucene 4
because the use of an FSA in the Lucene term dictionary. Or even better,
you can replace such a check by periodically pulling out the term list
of all words in the dictionary of an index, to get it processed off-line
at client side during query construction time, maybe once a day or so
(it depends on the frequency of updates and the arrival of new terms).

Just my 2c.

Jörg

Am 30.04.13 23:02, schrieb Otis Gospodnetic:

Hi,

When a query hits an ES node, where does it get parsed?
Does it get parsed on the node that received the query OR on each
individual node (or shard!) that executes the query?

And is there any way for clients to "pre-parse" the query and avoid
query parsing/rewriting at node/shard level, where I suspect the
parsing is currently done?

For example, the use case I have in mind is a system that does some
app-specific query building, including shingling. This app has the
knowledge about the context of the index and could then say "Ah, I
know this shingle never appears in the index, so remove it from the
query because it will just waste cycles".

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jorg and thank you for a thorough answer.

This part answers my question about where parsing is happening, which is
what I thought:

At each node a SearchService is active, which is holding a
SearchContext. When a new query arrives, it is passed to the shards in
string form (the "source" or the "extraSource").
The query source will get parsed by the SearchService at the
SearchContext creation time and this depends on the field mapping, where
the field mapping is fetched from the cluster state.

I'm wondering if/how one could push up the parsing so the same work is not
being done redundantly on each shard. Imagine you have a cluster with 100
shards. A query comes in and all 100 shards are searched. That mens the
same "query = parse(query string)" work will be done on all 100 shards.
Maybe doing it just once on the node that received the query would be
cheaper, though I assume that means the "query" object would have to be
serialized for sending over the network and then deserialized?

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, May 2, 2013 5:13:25 AM UTC-4, Jörg Prante wrote:

Note, the "context of the index" you mention is hard to deduce - in ES,
the SearchContext does not hold information about index term statistics
or term lists, it operates on shard level (Lucene index level) and does
not accumulate knowledge about the higher ES index level.

At each node a SearchService is active, which is holding a
SearchContext. When a new query arrives, it is passed to the shards in
string form (the "source" or the "extraSource").
The query source will get parsed by the SearchService at the
SearchContext creation time and this depends on the field mapping, where
the field mapping is fetched from the cluster state.

If you use scan/scroll API, you control the SearchService to reuse the
SearchContexts of the participating shards, and the query will not get
parsed again.

If you want to reuse a parsed query across searches (and across
searching clients), you have some challenges:

  • what is the lifespan of a valid parsed query?
  • what happens when the mapping changes, should a parsed query respect
    the change?
  • what is the most convenient policy for caching parsed queries? The top
    used queries? How to coordinate the nodes to count the parsed queries
    then?
  • how about enhancing query statements with binding variables?

This housekeeping of a distributed query cache comes with a certain
overhead in a distributed system. Except for a few edge cases, I expect
it is faster to simply reparse the query (for example, if you only have
one query which does never change - very unusual). With the query DSL
language, it is straightforward to translate a source string into a
Lucene Query object.

Your use case of filtering out shingles not present in an index to
reduce the number of boolean OR terms in a query might be solved by
using a spellcheck algorithm variant, which is nothing but checking for
a term existing in the index. The suggestion result could be a
preprocessed filter term list. This should be very fast in Lucene 4
because the use of an FSA in the Lucene term dictionary. Or even better,
you can replace such a check by periodically pulling out the term list
of all words in the dictionary of an index, to get it processed off-line
at client side during query construction time, maybe once a day or so
(it depends on the frequency of updates and the arrival of new terms).

Just my 2c.

Jörg

Am 30.04.13 23:02, schrieb Otis Gospodnetic:

Hi,

When a query hits an ES node, where does it get parsed?
Does it get parsed on the node that received the query OR on each
individual node (or shard!) that executes the query?

And is there any way for clients to "pre-parse" the query and avoid
query parsing/rewriting at node/shard level, where I suspect the
parsing is currently done?

For example, the use case I have in mind is a system that does some
app-specific query building, including shingling. This app has the
knowledge about the context of the index and could then say "Ah, I
know this shingle never appears in the index, so remove it from the
query because it will just waste cycles".

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Of course it is correct, 1 parse is faster than 100 parses.

Right now ES can push the source string over the wire by ES-specific
serialization (via the org.elasticsearch.common.xcontent.ToXContent
interface).
With being parsed to Lucene objects, these objects would have to be
treated as wrapped objects since Lucene does not offer serialization of
org.apache.lucene.search.Query.
When transporting all fields and methods over the wire, it can be
assumed the binarized query objects will take more space than a string,
and this would add to the time of the wrap/unwrap procedure by specially
crafted code (nothing but boilerplate code located at the node level).
This competes with very fast recursive descendent style parsing to
Lucene Query objects completely in memory, where the Query DSL is only a
thin wrapper.
Also note, query execution is divided into several phases in ES where
parsing is just a tiny step. The best optimization is to involve only
shards that are relevant to the query. And some query phases may
overweigh query parse time by far, for example faceting phase. There are
also several query execution paths (dfs query then fetch, query then
fetch, dfs query and fetch, scan action, count action) which are all
different according to execution overhead and shard involvement. As
said, scan/scroll is already using a search context id, preventing the
query from being parsed again.

Although it is true that 1 parse is faster than 100 parses, I doubt the
saved time can overweigh the existing optimization along the query
execution path.

To prove this, a first step could be adding some metrics into the ES
code to find out how much time is spent in query parsing (and in other
parts of the query execution path).

Personally I think the opposite strategy is more promising, i.e. extra
parsing a query for cost-based execution time estimation and rewriting
queries for shorter execution time or even rejecting too expensive queries.

Jörg

Am 04.05.13 02:37, schrieb Otis Gospodnetic:

I'm wondering if/how one could push up the parsing so the same work is
not being done redundantly on each shard. Imagine you have a cluster
with 100 shards. A query comes in and all 100 shards are searched.
That mens the same "query = parse(query string)" work will be done on
all 100 shards. Maybe doing it just once on the node that received
the query would be cheaper, though I assume that means the "query"
object would have to be serialized for sending over the network and
then deserialized?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.