frazer,
Wildcard queries are notorious for being performance hogs; Lucene
doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.
To get around this, depending on your wildcard, you can index each
letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:
Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas
Then, you would search on just the term. Note, this would only work
where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.
Of course, the tradeoff here is the size of the index, it will
increase tremendously, but you'll gain much better performance for these
types of queries.
- Nick
-----Original Message-----
From: frazer [mailto:frazer.horn@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance
I have the following query:
{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}
]
}
}
]
}
},
"fields":[]
}
'
It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.
If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms
If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms
I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds
Could this be a bug?