In my continuing quest to make my search fast enough I've run into another
roadblock: phrase queries. On most user queries I generate a regular
boolean query for their terms but I also generate a rescore that checks if
their query matches as a phrase query with slop 1. That means that every
query is also a phrase query. I'm found that varying the size of the
rescore varies performance considerably:
1024 will push one or two of my servers over the edge and they'll start io
thrashing.
256 is actually OK if the caches are hot but if they aren't can push me
into io thrash.
64 seems perfectly ok. Comfortable even.
Obviously if I throw more hardware at the problem it'll get better - more
replicas and shards and better disks will help. So will more ram. Ram
makes everything better.....
Anyway - say my hardware cycle takes a few months and I need a fix faster -
is there something I can do? I'm reasonably sure I can do something with
a shingle filter but I'm not sure exactly what that something is in the
case of queries with a slop. Has anyone had cause like this before?
One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives
is probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.
On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett nik9000@gmail.com wrote:
One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives is
probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.
For this purpose, why not stay with small window sizes (e.g. your 64,
or maybe even much smaller). IMO terms being present within massively
large windows means nothing. Personally i would consider one much
smaller, like 5. I know there have been experiments/papers around
this, i can dig up if you need, but I think its also kind of
intuitive.
This is probably a lot easier than doing anything around speeding up
sloppy phrase scoring.
On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett nik9000@gmail.com wrote:
One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to
the
top. It'd be better if it was the exact phrases but some false
positives is
probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.
For this purpose, why not stay with small window sizes (e.g. your 64,
or maybe even much smaller). IMO terms being present within massively
large windows means nothing. Personally i would consider one much
smaller, like 5. I know there have been experiments/papers around
this, i can dig up if you need, but I think its also kind of
intuitive.
This is probably a lot easier than doing anything around speeding up
sloppy phrase scoring.
Sorry, I mean the rescore window. I just set the phrase slop window to 1.
0 ignores some good matches - 2 brings up too much. Thats really me more
tuning it to my tastes than anything but yeah. I could try setting a
higher slop and see if that improves precision and what that costs for
performance though.
On Mon, Sep 8, 2014 at 10:24 PM, Nikolas Everett nik9000@gmail.com wrote:
In my continuing quest to make my search fast enough I've run into another
roadblock: phrase queries. On most user queries I generate a regular
boolean query for their terms but I also generate a rescore that checks if
their query matches as a phrase query with slop 1. That means that every
query is also a phrase query. I'm found that varying the size of the
rescore varies performance considerably:
1024 will push one or two of my servers over the edge and they'll start io
thrashing.
256 is actually OK if the caches are hot but if they aren't can push me
into io thrash.
64 seems perfectly ok. Comfortable even.
Obviously if I throw more hardware at the problem it'll get better - more
replicas and shards and better disks will help. So will more ram. Ram
makes everything better.....
Anyway - say my hardware cycle takes a few months and I need a fix faster
is there something I can do? I'm reasonably sure I can do something
with a shingle filter but I'm not sure exactly what that something is in
the case of queries with a slop. Has anyone had cause like this before?
One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives
is probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.