Slow performance on phrase queries in should clause

Our system is normally very responsive, but very occasionally people submit
long phrase queries which timeout and cause high system load. Not all long
phrase queries cause issues, but I have been debugging one that I've
found.[1]

The query is in the filter section of a constant score query as below. This
form times out. However if I move the query out of the should section and
into the must section, the query runs very quickly (in the full query,
there was another filter in the should section). Converting this to an AND
filter is also fast. Is there a reason for this? Are should filters
executed on the full set and not short circuited with the results of must
filters?

{

"query": {

    "constant_score": {

        "filter": {

            "bool": {

                "must": { "terms": { -- selective terms filter.... -- } 

},

                "should": { "query": { "match": { "text": { "query": 

"…", "type": "phrase" } } } }

            }

        }

    }

}

}

[1] query
-- ぶ新サービスは2015年春にリリースの予定。IoTのハードウェアそのものではなく、SDKやデータベース、解析、IDといったバックグラウンド環境をサービスとして提供するというものだ。発表後、松本氏は「例えばイケてる時計型のプロダクトを作ったとして、(機能面では)単体での価値は1〜2割だったりする。でも本当に重要なのはバックエンド。しかしユーザーから見てみれば時計というプロダクトそのものに大きな価値を感じることが多い。そうであれば、IoTのバックエンドをBaaS(Backend
as a
Service:ユーザーの登録や管理、データ保管といったバックエンド環境をサービスとして提供すること)のように提供できればプロダクトの開発に集中できると思う。クラウドが出てネットサービスの開発が手軽になったのと同じような環境を提供したい」とサービスについて語ってくれた。

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It's likely the should is (stupidly) being fully expanded before being
AND'd with the must ... but there are improvements here
(XBooleanFilter.java) to this in master, are you able to test and see if
it's still slow?

Mike McCandless

http://blog.mikemccandless.com

2014-12-04 19:21 GMT-05:00 Kireet Reddy kireet@feedly.com:

Our system is normally very responsive, but very occasionally people
submit long phrase queries which timeout and cause high system load. Not
all long phrase queries cause issues, but I have been debugging one that
I've found.[1]

The query is in the filter section of a constant score query as below.
This form times out. However if I move the query out of the should section
and into the must section, the query runs very quickly (in the full query,
there was another filter in the should section). Converting this to an AND
filter is also fast. Is there a reason for this? Are should filters
executed on the full set and not short circuited with the results of must
filters?

{

"query": {

    "constant_score": {

        "filter": {

            "bool": {

                "must": { "terms": { -- selective terms filter....

-- } },

                "should": { "query": { "match": { "text": { "query":

"…", "type": "phrase" } } } }

            }

        }

    }

}

}

[1] query
-- ぶ新サービスは2015年春にリリースの予定。IoTのハードウェアそのものではなく、SDKやデータベース、解析、IDといったバックグラウンド環境をサービスとして提供するというものだ。発表後、松本氏は「例えばイケてる時計型のプロダクトを作ったとして、(機能面では)単体での価値は1〜2割だったりする。でも本当に重要なのはバックエンド。しかしユーザーから見てみれば時計というプロダクトそのものに大きな価値を感じることが多い。そうであれば、IoTのバックエンドをBaaS(Backend
as a
Service:ユーザーの登録や管理、データ保管といったバックエンド環境をサービスとして提供すること)のように提供できればプロダクトの開発に集中できると思う。クラウドが出てネットサービスの開発が手軽になったのと同じような環境を提供したい」とサービスについて語ってくれた。

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRci9T%2BEQrXLS2rH1L1hhNVPmsQXCkHxQretAfEuo3RAYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I spent some more time debugging this yesterday, and it started driving me
a little crazy. I thought to test my theory, I should reduce the number of
terms in my must filter from ~ 100, to 1. If the should was executing over
all documents, the query should remain slow. But it ended up executing
quickly! So I am a little lost as to what's going on. Does
elasticsearch/lucene use any heuristics about which clause to execute first
that might cause this? I am using 1.3.5.

I'll ask our ops guys about seeing if we can setup an installation of the
master branch and see if there's any improvement. Would I need to change
the query at all? In the meantime, is there anything I can do on the 1.3
branch? Should I split off should clauses into a separate bool filter and
wrap it in an and? I.e.
AND of

  • bool filters with selective terms filter
  • bool filters with must filters

Also, I've run into a few of there performance issues, it would have been
really helpful if there was something like an explain plan for database
queries, or if I could set an explain type option on the query and it would
collect performance info at each step while processing the query and send
it back with the results. Right now it's really kind of a black box for me,
especially with caching kicking in at times. Has there ever been any
thought about implementing something like this in lucene/elasticsearch?

Thanks
Kireet

On Friday, December 5, 2014 3:12:49 AM UTC-8, Michael McCandless wrote:

It's likely the should is (stupidly) being fully expanded before being
AND'd with the must ... but there are improvements here
(XBooleanFilter.java) to this in master, are you able to test and see if
it's still slow?

Mike McCandless

http://blog.mikemccandless.com

2014-12-04 19:21 GMT-05:00 Kireet Reddy <kir...@feedly.com <javascript:>>:

Our system is normally very responsive, but very occasionally people
submit long phrase queries which timeout and cause high system load. Not
all long phrase queries cause issues, but I have been debugging one that
I've found.[1]

The query is in the filter section of a constant score query as below.
This form times out. However if I move the query out of the should section
and into the must section, the query runs very quickly (in the full query,
there was another filter in the should section). Converting this to an AND
filter is also fast. Is there a reason for this? Are should filters
executed on the full set and not short circuited with the results of must
filters?

{

"query": {

    "constant_score": {

        "filter": {

            "bool": {

                "must": { "terms": { -- selective terms filter.... 

-- } },

                "should": { "query": { "match": { "text": { "query": 

"…", "type": "phrase" } } } }

            }

        }

    }

}

}

[1] query
-- ぶ新サービスは2015年春にリリースの予定。IoTのハードウェアそのものではなく、SDKやデータベース、解析、IDといったバックグラウンド環境をサービスとして提供するというものだ。発表後、松本氏は「例えばイケてる時計型のプロダクトを作ったとして、(機能面では)単体での価値は1〜2割だったりする。でも本当に重要なのはバックエンド。しかしユーザーから見てみれば時計というプロダクトそのものに大きな価値を感じることが多い。そうであれば、IoTのバックエンドをBaaS(Backend
as a
Service:ユーザーの登録や管理、データ保管といったバックエンド環境をサービスとして提供すること)のように提供できればプロダクトの開発に集中できると思う。クラウドが出てネットサービスの開発が手軽になったのと同じような環境を提供したい」とサービスについて語ってくれた。

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f9ba758-0895-433f-b7f3-d27d9ef8627c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Just a wild guess here, but do the slow phrase queries contain duplicates?
For example,

"the the the the the"

Again, just a guess based on some past experience with another engine.
Duplicate words in a phrase query would cause a significant slowdown even
with a tiny database on a locally hosted blazingly fast machine. That was
its only weak point, but it was a significant weak point.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/60f34a78-bb6d-49f4-b533-f6cd030eb22f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.