Need Help: Upgrade of ES + Large queries = new CPU overload


(Scott Decker) #1

Hey all,
We have been testing the new 1.3.1 release on our current load and
queries, and have found that under same conditions, same queries, the es
cluster we have just starts to max out cpu and the thread pools fill up and
the query times just keep going up until eventually we have the restart
nodes just to clear things.
on our older (.20.6) version, we do have big queries. Think 100+ terms, but
they were all wrapped in a filter and cached. We almost never did any
scoring, and if we did, it was only on a few terms.
so, a query may look like the following:

"query": {
"filtered": {
"query": {
"constant_score": {
"query": {
"bool": {
"must": [
{
"bool": {
"should": {
"bool": {
"should": [
{
"term": {
"content": "smyrna"
}
},
{
"term": {
"title": "smyrna"
}
}
]
}
}
}
}
]
}
},
"boost": 1
}
},
"filter": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"bool": {
"must": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"terms": {<fill in long lists of ids
here}

and this filter is broken up into multiple sections, each filter is 

given a cache name and cache key

so, what could have changed between .20.6 and 1.3.1 that would cause this
sort of non-scored filter query to suddenly spend so much cpu time running?
i did a thread dump and it is setting multiple threads in the .scorer state
of the filteredquery.
not sure if that matters.

any help in trying to figure out where the es is spending its time on all
of this would be helpful. we at least have marvel up and running now and
that tells us that cpu gets pegged and the avg query times, but not sure
how to start debugging the query side to see could be changed under the
hoods to cause such a drastic change.

Thanks,
Scott

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/be45cf36-3b4a-4452-b3bc-461a879dec02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Scott Decker) #2

well, in case anyone wants to know, it was because we had
_cache:true
and
_cache_key:

items in our filter sets.
basically because they are known filters that do not change.

for some reason, having this set caused huge amounts of cpu usage. not sure
what was happening behind the scenes, but this was our culprit. will have
to look into the code and see what causes this to cause such an issue.

On Monday, September 1, 2014 7:50:35 AM UTC-7, Scott Decker wrote:

Hey all,
We have been testing the new 1.3.1 release on our current load and
queries, and have found that under same conditions, same queries, the es
cluster we have just starts to max out cpu and the thread pools fill up and
the query times just keep going up until eventually we have the restart
nodes just to clear things.
on our older (.20.6) version, we do have big queries. Think 100+ terms,
but they were all wrapped in a filter and cached. We almost never did any
scoring, and if we did, it was only on a few terms.
so, a query may look like the following:

"query": {
"filtered": {
"query": {
"constant_score": {
"query": {
"bool": {
"must": [
{
"bool": {
"should": {
"bool": {
"should": [
{
"term": {
"content": "smyrna"
}
},
{
"term": {
"title": "smyrna"
}
}
]
}
}
}
}
]
}
},
"boost": 1
}
},
"filter": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"bool": {
"must": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"terms": {<fill in long lists of ids
here}

and this filter is broken up into multiple sections, each filter is 

given a cache name and cache key

so, what could have changed between .20.6 and 1.3.1 that would cause this
sort of non-scored filter query to suddenly spend so much cpu time running?
i did a thread dump and it is setting multiple threads in the .scorer
state of the filteredquery.
not sure if that matters.

any help in trying to figure out where the es is spending its time on all
of this would be helpful. we at least have marvel up and running now and
that tells us that cpu gets pegged and the avg query times, but not sure
how to start debugging the query side to see could be changed under the
hoods to cause such a drastic change.

Thanks,
Scott

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/19fee079-ce3e-4e77-a4b4-7e95c35f9d98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3