Every other query slow

Hello

I'm trying to get a new ES cluster tuned properly to actually put into
production, and I'm running into some performance issues.

While testing, I noticed that when running the same query multiple times, I
had alternating fast (~50 ms), and slow (2-3 s) results. It's the exact
same query, submitted via curl, and it happens consistently query after
query.

curl -XGET localhost:9200/my_index/my_type/_search?routing=1234 -d
@fileWithQuery

I literally hit up/enter time after time. At one point I wandered away for
~30 minutes after a slow execution, came back, up/enter, it finished in 40
ms. The immediate next attempt, 2 seconds.

I'm running ES 1.1.1 on a two-node cluster. There's three indexes, three
shards each for the smaller two, six shards for the larger index, which is
the one I'm hitting. I'm using custom routing for all three. One replica
of each, so all 12 shards on each server. The index in question is ~125
GB. The other two are 10 GB and 2 GB more or less.

Summary server info:
ES 1.1.1
2 AWS r3.xlarge instances (30.5 GB each)
One 18 GB heap/G1 GC
One 20 GB heap/default GC

I know the heap is set higher than recommended, but I wouldn't think that'd
be the current problem.

My first thought was that I was simply hitting one server and then the
other via round-robin, and I needed to figure out which server was slow.
However, the stats reported in ElasticHQ indicated that the queries were
hitting the same server each time (there was no other searching going on
and limited indexing). Even when I tried running the search from the other
server, ElasticHQ still indicated that they queries were running on the one
server (and the same fast/slow/fast pattern was noticed, though independent
of the cycle on the other server). I'm not sure why the other server was
never being hit, though ElasticHQ DID report about 3x the amount of search
activity on the server on which the queries were running. That might be my
next question.

There are warmers in place for some of the fields. Field cache reports no
evictions and hovers around 4GB, though there ARE a lot of evictions in the
filter cache. I think that's probably inevitable given how much variety
can come through in the searches, though I'm open to advice.

I've pasted a sample query below. It's admittedly a bit ugly, because it's
built dynamically from a large number of search criteria with various
levels of nesting. I've tried cleaned up versions of the same query
(remove unnecessary filters) with the same results, but included it as is
(with renamed fields) in case there's something wrong.

Note that while I've been testing and writing this post, I found that
removing the nested sort and instead sorting on a non-nested field does not
result in the fast/slow/fast pattern, they're all fast. However, I've
since tested other queries, including some with no sort/limit at all, and
found the same pattern. There is a lot of nesting, and sometimes has_child
filters. Executing somewhat different (though admittedly similar) queries
results in the same pattern across queries, regardless of which is run
when. Fast/slow/fast.

So, any idea as to what is going on here? The fast queries are completely
adequate, the slow queries completely inadequate. I need to figure this
out.

Let me know if any other info is needed. Thanks in advance.

{
"from" : 0,
"size" : 50,
"query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : {
"filters" : [ {
"term" : {
"accountId" : 1234
}
}, {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"or" : {
"filters" : [ {
"term" : {
"stage1.stage2.bool1" : true
}
}, {
"term" : {
"stage1.stage2.bool2" : false
}
} ]
}
} ]
}
},
"path" : "stage1.stage2"
}
} ]
}
},
"path" : "stage1"
}
} ]
}
}
}
},
"fields" : "id",
"sort" : [ {
"website.domain.sortable" : {
"order" : "asc",
"missing" : "0",
"nested_path" : "website"
}
}, {
"id" : {
"order" : "asc"
}
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a9f6678-34e3-4f56-b55d-2137b45833dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Note also that the slow instances of the query do not appear to show up in
the slow query log. Also, I'm pulling the referenced times out of the
response's "took" field.

On Tuesday, July 8, 2014 11:09:21 PM UTC-4, Jonathan Foy wrote:

Hello

I'm trying to get a new ES cluster tuned properly to actually put into
production, and I'm running into some performance issues.

While testing, I noticed that when running the same query multiple times,
I had alternating fast (~50 ms), and slow (2-3 s) results. It's the exact
same query, submitted via curl, and it happens consistently query after
query.

curl -XGET localhost:9200/my_index/my_type/_search?routing=1234 -d
@fileWithQuery

I literally hit up/enter time after time. At one point I wandered away
for ~30 minutes after a slow execution, came back, up/enter, it finished in
40 ms. The immediate next attempt, 2 seconds.

I'm running ES 1.1.1 on a two-node cluster. There's three indexes, three
shards each for the smaller two, six shards for the larger index, which is
the one I'm hitting. I'm using custom routing for all three. One replica
of each, so all 12 shards on each server. The index in question is ~125
GB. The other two are 10 GB and 2 GB more or less.

Summary server info:
ES 1.1.1
2 AWS r3.xlarge instances (30.5 GB each)
One 18 GB heap/G1 GC
One 20 GB heap/default GC

I know the heap is set higher than recommended, but I wouldn't think
that'd be the current problem.

My first thought was that I was simply hitting one server and then the
other via round-robin, and I needed to figure out which server was slow.
However, the stats reported in ElasticHQ indicated that the queries were
hitting the same server each time (there was no other searching going on
and limited indexing). Even when I tried running the search from the other
server, ElasticHQ still indicated that they queries were running on the one
server (and the same fast/slow/fast pattern was noticed, though independent
of the cycle on the other server). I'm not sure why the other server was
never being hit, though ElasticHQ DID report about 3x the amount of search
activity on the server on which the queries were running. That might be my
next question.

There are warmers in place for some of the fields. Field cache reports no
evictions and hovers around 4GB, though there ARE a lot of evictions in the
filter cache. I think that's probably inevitable given how much variety
can come through in the searches, though I'm open to advice.

I've pasted a sample query below. It's admittedly a bit ugly, because
it's built dynamically from a large number of search criteria with various
levels of nesting. I've tried cleaned up versions of the same query
(remove unnecessary filters) with the same results, but included it as is
(with renamed fields) in case there's something wrong.

Note that while I've been testing and writing this post, I found that
removing the nested sort and instead sorting on a non-nested field does not
result in the fast/slow/fast pattern, they're all fast. However, I've
since tested other queries, including some with no sort/limit at all, and
found the same pattern. There is a lot of nesting, and sometimes has_child
filters. Executing somewhat different (though admittedly similar) queries
results in the same pattern across queries, regardless of which is run
when. Fast/slow/fast.

So, any idea as to what is going on here? The fast queries are completely
adequate, the slow queries completely inadequate. I need to figure this
out.

Let me know if any other info is needed. Thanks in advance.

{
"from" : 0,
"size" : 50,
"query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : {
"filters" : [ {
"term" : {
"accountId" : 1234
}
}, {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"or" : {
"filters" : [ {
"term" : {
"stage1.stage2.bool1" : true
}
}, {
"term" : {
"stage1.stage2.bool2" : false
}
} ]
}
} ]
}
},
"path" : "stage1.stage2"
}
} ]
}
},
"path" : "stage1"
}
} ]
}
}
}
},
"fields" : "id",
"sort" : [ {
"website.domain.sortable" : {
"order" : "asc",
"missing" : "0",
"nested_path" : "website"
}
}, {
"id" : {
"order" : "asc"
}
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad9d760a-2cf2-41fd-a60b-bd8aa7229942%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.