Wildcard query performance


(frazer) #1

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

                    ]
                }
            }
        ]
    }
},
"fields":[]

}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(Shay Banon) #2

I can't see how this can be a bug, it can possibly perform bad if there are many terms that match.
On Wednesday, May 4, 2011 at 8:10 PM, frazer wrote:

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

]
}
}
]
}
},
"fields":[]
}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(Administrator-2) #3

frazer,

Wildcard queries are notorious for being performance hogs; Lucene

doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

To get around this, depending on your wildcard, you can index each

letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

Then, you would search on just the term.  Note, this would only work

where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

Of course, the tradeoff here is the size of the index, it will

increase tremendously, but you'll gain much better performance for these
types of queries.

	- Nick

-----Original Message-----
From: frazer [mailto:frazer.horn@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

                    ]
                }
            }
        ]
    }
},
"fields":[]

}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(Shay Banon) #4

Just to add on that, many times stemming or using ngram (analysis) can also "solve" the problem for most cases. This can be done also using multi field option.
On Wednesday, May 4, 2011 at 8:38 PM, Administrator wrote:

frazer,

Wildcard queries are notorious for being performance hogs; Lucene
doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

To get around this, depending on your wildcard, you can index each
letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

Then, you would search on just the term. Note, this would only work
where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

Of course, the tradeoff here is the size of the index, it will
increase tremendously, but you'll gain much better performance for these
types of queries.

  • Nick

-----Original Message-----
From: frazer [mailto:frazer.horn@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

]
}
}
]
}
},
"fields":[]
}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(frazer) #5

Thanks for the responses, I was trying to point out that there is a
big difference in performance between these 2 wildcard queries

  1. query_string": {"query":"ax* OR year*"} << 98 ms
  2. query_string": {"query":"12* OR year*"} << 6 seconds

Im surprised by the difference in performance as I wouldn't expect the
second query to perform much different to the first, hence I wondered
if it was a bug but maybe im missing something.

Thanks for your time

Frazer

On May 4, 1:41 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Just to add on that, many times stemming or using ngram (analysis) can also "solve" the problem for most cases. This can be done also using multi field option.

On Wednesday, May 4, 2011 at 8:38 PM, Administrator wrote:

frazer,

Wildcard queries are notorious for being performance hogs; Lucene
doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

To get around this, depending on your wildcard, you can index each
letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

Then, you would search on just the term. Note, this would only work
where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

Of course, the tradeoff here is the size of the index, it will
increase tremendously, but you'll gain much better performance for these
types of queries.

  • Nick

-----Original Message-----
From: frazer [mailto:frazer.h...@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

]
}
}
]
}
},
"fields":[]
}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(Shay Banon) #6

There are probably many more terms matching 12* compared to ax* (in the _all field, note, numeric fields are added to it as well).
On Wednesday, May 4, 2011 at 9:01 PM, frazer wrote:

Thanks for the responses, I was trying to point out that there is a
big difference in performance between these 2 wildcard queries

  1. query_string": {"query":"ax* OR year*"} << 98 ms
  2. query_string": {"query":"12* OR year*"} << 6 seconds

Im surprised by the difference in performance as I wouldn't expect the
second query to perform much different to the first, hence I wondered
if it was a bug but maybe im missing something.

Thanks for your time

Frazer

On May 4, 1:41 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Just to add on that, many times stemming or using ngram (analysis) can also "solve" the problem for most cases. This can be done also using multi field option.

On Wednesday, May 4, 2011 at 8:38 PM, Administrator wrote:

frazer,

Wildcard queries are notorious for being performance hogs; Lucene
doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

To get around this, depending on your wildcard, you can index each
letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

Then, you would search on just the term. Note, this would only work
where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

Of course, the tradeoff here is the size of the index, it will
increase tremendously, but you'll gain much better performance for these
types of queries.

  • Nick

-----Original Message-----
From: frazer [mailto:frazer.h...@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

]
}
}
]
}
},
"fields":[]
}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(frazer) #7

(in the _all field, note, numeric fields are added to it as well) << I
wouldn't mind betting thats it, we have quite a few numeric fields

Thanks for your help

On May 4, 2:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

There are probably many more terms matching 12* compared to ax* (in the _all field, note, numeric fields are added to it as well).

On Wednesday, May 4, 2011 at 9:01 PM, frazer wrote:

Thanks for the responses, I was trying to point out that there is a
big difference in performance between these 2 wildcard queries

  1. query_string": {"query":"ax* OR year*"} << 98 ms
  2. query_string": {"query":"12* OR year*"} << 6 seconds

Im surprised by the difference in performance as I wouldn't expect the
second query to perform much different to the first, hence I wondered
if it was a bug but maybe im missing something.

Thanks for your time

Frazer

On May 4, 1:41 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Just to add on that, many times stemming or using ngram (analysis) can also "solve" the problem for most cases. This can be done also using multi field option.

On Wednesday, May 4, 2011 at 8:38 PM, Administrator wrote:

frazer,

Wildcard queries are notorious for being performance hogs; Lucene
doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

To get around this, depending on your wildcard, you can index each
letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

Then, you would search on just the term. Note, this would only work
where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

Of course, the tradeoff here is the size of the index, it will
increase tremendously, but you'll gain much better performance for these
types of queries.

  • Nick

-----Original Message-----
From: frazer [mailto:frazer.h...@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

]
}
}
]
}
},
"fields":[]
}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(Otis Gospodnetić) #8

It's worth pointing out that in Lucene 4.* Wildcard queries are 100x
faster:

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On May 4, 1:38 pm, "Administrator" ad...@sf4answers.com wrote:

frazer,

    Wildcard queries are notorious for being performance hogs; Lucene

doesn't know how to break the word down to any unit less than a term. To
satify a wildcard query, it has to go through all the items and see that the
pattern exists in each term. For large result sets this causes a tremendous
amount of processing overhead.

    To get around this, depending on your wildcard, you can index each

letter individually. Let's say you are using my first name in a field
called name and would want to run the wildcard query where the wildcard is
at the end. You could do this:

Name: N
Name: Ni
Name: Nic
Name: Nich
Name: Nicho
Name: Nichol
Name: Nichola
Name: Nicholas

    Then, you would search on just the term.  Note, this would only work

where the wildcard is at one of the ends. If you want it in other
positions, you would have to set different terms up and then perform a term
query against that.

    Of course, the tradeoff here is the size of the index, it will

increase tremendously, but you'll gain much better performance for these
types of queries.

            - Nick

-----Original Message-----
From: frazer [mailto:frazer.h...@gmail.com]
Sent: Wednesday, May 04, 2011 1:10 PM
To: users
Subject: Wildcard query performance

I have the following query:

{
"sort":
[
"_score"
],
"query":
{
"bool":
{
"must":
[
{
"term":
{
"example_id":1
}
},
{
"bool":
{
"minimum_number_should_match":1,
"should":
[
{
"query_string":
{
"query":"12* OR year*"
}
}

                    ]
                }
            }
        ]
    }
},
"fields":[]

}
'

It performs quite badly, it takes around 6.5 seconds, the index size
is around 2.5 million documents.

If I remove the wildcard:
query_string": {"query":"12 OR year*"} << 22 ms

If I change the query to wildcard for letters:
query_string": {"query":"ax* OR year*"} << 98 ms

I understand there would be some performance penalty for wildcard
searches but I wouldn't expect the number wildcard search to perform
so badly.
query_string": {"query":"12* OR year*"} << 6.5 seconds

Could this be a bug?


(system) #9