Faster sloppy phrase queries

nik9000 · September 8, 2014, 8:24pm

In my continuing quest to make my search fast enough I've run into another
roadblock: phrase queries. On most user queries I generate a regular
boolean query for their terms but I also generate a rescore that checks if
their query matches as a phrase query with slop 1. That means that every
query is also a phrase query. I'm found that varying the size of the
rescore varies performance considerably:
1024 will push one or two of my servers over the edge and they'll start io
thrashing.
256 is actually OK if the caches are hot but if they aren't can push me
into io thrash.
64 seems perfectly ok. Comfortable even.

Obviously if I throw more hardware at the problem it'll get better - more
replicas and shards and better disks will help. So will more ram. Ram
makes everything better.....

Anyway - say my hardware cycle takes a few months and I need a fix faster -
is there something I can do? I'm reasonably sure I can do something with
a shingle filter but I'm not sure exactly what that something is in the
case of queries with a slop. Has anyone had cause like this before?

One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives
is probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.

Thanks for reading!

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1ae3Z3h7j2sK6Q26-0uQFq_wcSj1fhXap0aZ9MN3R5mQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Robert_Muir_2 · September 8, 2014, 8:42pm

On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett nik9000@gmail.com wrote:

One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives is
probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.

For this purpose, why not stay with small window sizes (e.g. your 64,
or maybe even much smaller). IMO terms being present within massively
large windows means nothing. Personally i would consider one much
smaller, like 5. I know there have been experiments/papers around
this, i can dig up if you need, but I think its also kind of
intuitive.

This is probably a lot easier than doing anything around speeding up
sloppy phrase scoring.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZV-9QuViFQer_ebMwF29rw7bdDhUuu4AqA-0Ugy2xTJ7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · September 8, 2014, 8:47pm

On Mon, Sep 8, 2014 at 4:42 PM, Robert Muir robert.muir@elasticsearch.com
wrote:

On Mon, Sep 8, 2014 at 4:24 PM, Nikolas Everett nik9000@gmail.com wrote:

One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to
the
top. It'd be better if it was the exact phrases but some false
positives is
probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.

For this purpose, why not stay with small window sizes (e.g. your 64,
or maybe even much smaller). IMO terms being present within massively
large windows means nothing. Personally i would consider one much
smaller, like 5. I know there have been experiments/papers around
this, i can dig up if you need, but I think its also kind of
intuitive.

This is probably a lot easier than doing anything around speeding up
sloppy phrase scoring.

Sorry, I mean the rescore window. I just set the phrase slop window to 1.
0 ignores some good matches - 2 brings up too much. Thats really me more
tuning it to my tastes than anything but yeah. I could try setting a
higher slop and see if that improves precision and what that costs for
performance though.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd36TJhnM8fh8xwVrB7mgr6cN%2BQfFeo6XMn7vYDaqtdKzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 8, 2014, 9:08pm

Is shingling for proximity boosting on multi term phrases an alternative,
like in
http://www.romseysoftware.co.uk/2012/09/27/proximity-boosting-in-elasticsearch/
?

Jörg

On Mon, Sep 8, 2014 at 10:24 PM, Nikolas Everett nik9000@gmail.com wrote:

In my continuing quest to make my search fast enough I've run into another
roadblock: phrase queries. On most user queries I generate a regular
boolean query for their terms but I also generate a rescore that checks if
their query matches as a phrase query with slop 1. That means that every
query is also a phrase query. I'm found that varying the size of the
rescore varies performance considerably:
1024 will push one or two of my servers over the edge and they'll start io
thrashing.
256 is actually OK if the caches are hot but if they aren't can push me
into io thrash.
64 seems perfectly ok. Comfortable even.

Obviously if I throw more hardware at the problem it'll get better - more
replicas and shards and better disks will help. So will more ram. Ram
makes everything better.....

Anyway - say my hardware cycle takes a few months and I need a fix faster

is there something I can do? I'm reasonably sure I can do something
with a shingle filter but I'm not sure exactly what that something is in
the case of queries with a slop. Has anyone had cause like this before?

One thing on my side is that I don't really need phrase queries. I can
play around with the specification a bit so long as I stay sane. I just
need to make documents that contain the terms near each other float to the
top. It'd be better if it was the exact phrases but some false positives
is probably ok. The phrase query got the job done but if there is a way to
cheat it I'm happy to try.

Thanks for reading!

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1ae3Z3h7j2sK6Q26-0uQFq_wcSj1fhXap0aZ9MN3R5mQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1ae3Z3h7j2sK6Q26-0uQFq_wcSj1fhXap0aZ9MN3R5mQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEV5jC8eoMsP%2BVqYeLZ-Zt71iNg%2BBeW0x41zoh21_CYOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · September 8, 2014, 9:11pm

On Mon, Sep 8, 2014 at 5:08 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Is shingling for proximity boosting on multi term phrases an alternative,

like in
http://www.romseysoftware.co.uk/2012/09/27/proximity-boosting-in-elasticsearch/
?

I'm not sure if it'll be good enough though - because its kind of like 0
slop and we're using 1 slop now. I can certainly try playing with it
though.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3gVnFu%3DDGtr6u_Y4tRSu5c5T_03zdbVMzKJE0eSj0h0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · September 9, 2014, 6:12pm

Hopefully Mike McCandless will get some of the new Lucene features into
Elasticsearch:

I suspect it will come soon.

--
Ivan

On Mon, Sep 8, 2014 at 2:11 PM, Nikolas Everett nik9000@gmail.com wrote:

On Mon, Sep 8, 2014 at 5:08 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Is shingling for proximity boosting on multi term phrases an alternative,

like in
http://www.romseysoftware.co.uk/2012/09/27/proximity-boosting-in-elasticsearch/
?

I'm not sure if it'll be good enough though - because its kind of like 0
slop and we're using 1 slop now. I can certainly try playing with it
though.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3gVnFu%3DDGtr6u_Y4tRSu5c5T_03zdbVMzKJE0eSj0h0w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3gVnFu%3DDGtr6u_Y4tRSu5c5T_03zdbVMzKJE0eSj0h0w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBd4nDHzXRiUaE313NxyPv5oQJHjpLZV7%3DNwa5pDvke0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Partial phrase or exact phrase matching Elasticsearch	10	7619	August 20, 2020
How fast can we achieve in theory using "match_phrase" query on 30M document index Elasticsearch	10	1725	March 20, 2018
Approximate Matching with multiple phrases Elasticsearch	2	449	May 1, 2018
Slop distance Elasticsearch	2	3174	July 6, 2017
5.6 > 7.9 upgrade, auto_generate_phrase_query alternatives? Elasticsearch	4	454	February 6, 2021

Faster sloppy phrase queries

Related topics