ES performance reliablity issue

My query is very simple, just like this:
{ "query": {"bool": { "filter": {"term": {"user_guid": "xxxx" }}}}}
the user_guid is equal to document id, and the query contains routing argument.

If I do this query several times(use different user_guid in each query), the "took" value in return json is less than 10ms.

But, If I do this query 100,000 times(different user_guid each query), there are about 500-1000 queries which take more than 50ms. I want to reduce the number to 0 or less than 10.

If there are 1000 high latency quries(more than 50ms):
just 100 of them in slowlog, 40% in fetch and 60% in query.
the other 900 not in slowlog, I can't get any information since there are no logs to analyze.

I do several tests:

  1. use different ES versions: 5.6.3 and 7.5.2.
  2. use different query types: get , query, filter.
  3. use different number of nodes: one node cluster, 10 nodes cluster
  4. use different qps: 10/s and 2000/s

All of the tests have the same issue, so I want to know if ES can not keep 100% low latency performance. The reason which causes the issue, and how to inprove it.

5.X is EOL, you should not be using it.

That's going to be impossible unfortunately.

How many records are these queries running against? What is the mapping of the field? How many indices and shards? What does hot_nodes look like when you run them? What are your node specs?

400 million documents in the index, each document has 12 fields, the type of each field is keyword.
The index has 40 shards. the size of index is 70GB.
The node server is 16vcpu 64Gmem with one 1T ssd disk. The cpu usage is 20% under testing.
The server performance is not the bottleneck, even if my script runs very slowly (qps 10/s), there are also 0.01%-0.1% of the queries which take more than 50ms.

Why do you have 40 shards for an index of 70GB? I would expect 2-3 shards to be more appropriate. How large is your heap? Garbage collection can cause temporary slowdowns but is often faster the smaller the heap is so it is important to set the heap size correctly. Larger is not always better.

You may also want to test the most recent version as it has switched to G1GC, which could also affect the latency profile.

1 Like

Now I do another test, I use a filter query like this
{
"profile": true,
"bool": {
"filter": {
"query": {
"term": {
"user_guid": "xxx"
}
}
}
}
}

and then I do a query with the same user_guid 1,600,000 times.
The results of filter query should hit the query cache, but there are still 233 records in slowlog(more than 50ms, some in query and some in fetch), and the total queries which take more than 50ms is 6926.

Since you have a single node there will at some point be GC which will affect latencies. Please try the things I suggested and see if it makes any difference.

GC is most likely the cause of the issue.
The frequency of young gc is 1-2/s when I use jstat to print the gc info, and then I switch it to G1GC ,or tune the cms jvm arguments (e.g. -Xmn10g -XX:SurvivorRatio=10 -XX:+UseParNewGC,-XX:MaxTenuringThreshold=15). The frequency of young gc reduces to 0.2/s and the number of queries which take more than 50ms reduce to 1/10-1/5.