Preventing slow queries on public APIs

Howdy,

I'm working on exposing the ES Query DSL on a public API and am looking at
filtering certain parts of the DSL to prevent some of the more obvious
performance problems that could be created. I'm trying to find a good
balance between allowing the full flexibility and power that the DSL
provides while preventing users from inadvertently (or intentionally)
interfering with each other's queries. So far the list of things I'm
thinking of filtering out are:

parameters disallowed anywhere:
_cache (I think I'd prefer that we retain control over the caching)
script (from what I can tell there is nothing stopping someone from
creating an infinite loop, or other very costly script)

match/multi_match query:
max_expansions < 15 for phrase queries

mlt, flt queries:
max_query_terms < 30

fuzzy query:
max_expansions is set and < 10
prefix_length > 1

Completely Disallow:
custom_score query (requires a script)
query_string query (lots of ways to write queries that fail)
top_children query (our data doesn't have any publicly available
children/parents right now)
wildcard query
indices query
text query (just cause it is deprecated)
script filter
has_child filter
has_parent filter

Facets are also not allowed

Anyone else looked at doing something similar? Any other types of queries
or edge cases that you've seen cause performance problems?

I'll hopefully remove some of these restrictions over time, but I'd rather
start out without worrying too much about it.

Thanks
-Greg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the wrapup of the cases. One additional case that comes to my
mind are phrase queries on frequent words.

See also
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
and
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
(old but good)

Jörg

Am 17.04.13 20:40, schrieb Greg Brown:

Anyone else looked at doing something similar? Any other types of
queries or edge cases that you've seen cause performance problems?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

Thanks for those links, they were really helpful.

I am using stopwords for all languages for which they are available which
should help in those cases. At the cost of queries like "to be or not to
be" failing. I hadn't seen the new cutoff_frequency parameter now available
in 0.90 which looks like it would mitigate this problem. Will have to
reevaluate whether to reindex with stopwords at some point. Feels like a
big change that could have other implications (faceting, other query types).

Looking at the match query again though I should probably impose a cap for
slop on phrase queries. Probably require it to be less than 4.

This discussion also reminds me that I once ran into a case where the max
number of boolean clauses in an AND filter was hit (1024). Will probably
add a limit of 50 clauses for:
and filter
or filter
bool filter
bool query

Also should have a max query depth when nesting a bunch of query DSL
statements. Start with 10.

On Wednesday, April 17, 2013 1:29:41 PM UTC-6, Jörg Prante wrote:

Thanks for the wrapup of the cases. One additional case that comes to my
mind are phrase queries on frequent words.

See also

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
and

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
(old but good)

Jörg

Am 17.04.13 20:40, schrieb Greg Brown:

Anyone else looked at doing something similar? Any other types of
queries or edge cases that you've seen cause performance problems?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just to add, this is also true for a "terms" query. The default clause
limit is 1024 in Lucene, but it can be raised.

Jörg

Am 17.04.13 22:47, schrieb Greg Brown:

This discussion also reminds me that I once ran into a case where the
max number of boolean clauses in an AND filter was hit (1024).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.