First search per connection takes ~30 seconds

I'm using Elastic Cloud (ES 5.4.2) and the Python client. We currently have about ~30M documents in our 1 node cluster across 17 indices and 81 shards. No memory pressure issues. I've been writing a Django front-end for our application. Everything was going great until I took a break and returned to find that the first search for each page load is taking about 30 seconds to complete. Subsequent searches on the same page load perform normally, including searches on different indices.

This is true even if the first search is extremely simple. In fact, it's true even if the first search has a syntax error! I can replicate the behavior in a test script running outside the Django environment:

# create connection
es = create_es_connection(args.credentials, args.environment)

print "start", datetime.now()
es.get(index='index-name', doc_type='type-name', id=1221)
print "1221", datetime.now()
es.get(index='index-name', doc_type='type-name', id=1222)
print "1222", datetime.now()

es = create_es_connection(args.credentials, args.environment)

print "start", datetime.now()
es.get(index='index-name', doc_type='type-name', id=1221)
print "1221", datetime.now()
es.get(index='index-name', doc_type='type-name', id=1222)
print "1222", datetime.now()

In each case the first search takes 25-35 seconds. The second search is nearly instantaneous, as expected. Additional searches, more complex searches, searches of other indices, and waiting several seconds between searches all produce identical results. It's always the first search that is the problem. Again, this is true even if the first search actually produces a syntax error.

What might have caused this behavior and how can I prevent it?

Here is some logging output, showing a simple match query and a more complex more like this query on a different index.

GET https://[cluster address].us-east-1.aws.found.io:9243/index1/_search [status:200 request:33.201s]
> {"query": {"match": {"id": "12"}}}
< {"took":0,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":5.2983174...
GET https://[cluster address].us-east-1.aws.found.io:9243/index2/_search [status:200 request:0.154s]
> {"query": {"bool": {"must": [{"more_like_this": ...
< {"took":64,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":244 ...

As you can see, the actual query is almost instantaneous but the request took 33s. The second request took only 0.154s, which is a lot more reasonable.

Update: Some more diagnosis revealed that it's a network issue of some kind. Connecting to the ES cluster via a different network eliminated the problem. Still not sure what the underlying issue was on the original network, though (firewall, DNS, etc). Any hints on diagnosing that would still be helpful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.