If the table has 10,000 rows, then this query is way way faster than
if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can
return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only
returns 1 result, but the query itself doesn't appear to be any faster
than if I had set the size to unlimited and returned all the results.
I suspect the query is being run in its entirety and only 1 result is
being returned after the query is done processing. This means the
"size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds
a single record that matches the fuzzy search, rather than processing
every record in the index before returning a response? Am I
misunderstanding something more fundamental about this?
I think you are misunderstanding how MySQL works. It calculates the entire
result set prior to limiting the number of results in a response. This is
understood clearly when you add an ORDER BY clause to the SQL query. It is
true with MySQL that if your ORDER BY and WHERE clauses include only fields
which are indexed, the indicies can be used to prevent a table scan
situation.
Your comparison of a fuzzy search using a "REALLY LONG..." text query in es
vs. a query by an exact match on a field that might be indexed in MySQL is
truly an apples to oranges question. You might as well of asked "How come
it takes my Mini Cooper only 15 seconds to do the 1/4 mile, when it takes
your Maserati Quattroporte an hour to drive from LA to San
Diego. Relational databases are very fast if they can avoid table scans and
just use indicies for queries.
I don't know much about how ES handles a fuzzy search, so someone can
correct me if I am way off base here. I suspect es will use its indicies to
perform the search, except it has to calculate the levenshtein distance
between the terms in each document against each term in your search
criteria. Something that gets exponentially (?) more time consuming the
more terms you have in your search query or the more terms in each document.
On Tue, Dec 20, 2011 at 5:12 PM, James Cook jcook@pykl.com wrote:
I think you are misunderstanding how MySQL works. It calculates the entire
result set prior to limiting the number of results in a response. This is
understood clearly when you add an ORDER BY clause to the SQL query. It is
true with MySQL that if your ORDER BY and WHERE clauses include only fields
which are indexed, the indicies can be used to prevent a table scan
situation.
Your comparison of a fuzzy search using a "REALLY LONG..." text query in es
vs. a query by an exact match on a field that might be indexed in MySQL is
truly an apples to oranges question. You might as well of asked "How come it
takes my Mini Cooper only 15 seconds to do the 1/4 mile, when it takes your
Maserati Quattroporte an hour to drive from LA to San Diego. Relational
databases are very fast if they can avoid table scans and just use indicies
for queries.
I don't know much about how ES handles a fuzzy search, so someone can
correct me if I am way off base here. I suspect es will use its indicies to
perform the search, except it has to calculate the levenshtein distance
between the terms in each document against each term in your search
criteria. Something that gets exponentially (?) more time consuming the more
terms you have in your search query or the more terms in each document.
Your comparison of a fuzzy search using a "REALLY LONG..." text query in es
vs. a query by an exact match
Thats exactly the point. @Brian: To do an exact string match like in
MySql just do not analyze the field and I bet you'll get identical if
not better response times.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.