Bug? Elasticsearch 7 (after upgrade from 6.7) slow search during 3-5 seconds every 60 seconds

DavidTurner · April 17, 2019, 7:32am

Ok, try the following settings:

PUT /_cluster/settings
{
  "persistent": {
    "transport.tracer.include": "indices:data/read/search*",
    "logger.org.elasticsearch.search": "TRACE",
    "logger.org.elasticsearch.tasks.TaskManager": "TRACE",
    "logger.org.elasticsearch.transport.TransportService.tracer": "TRACE"
  }
}

This should give us some more detailed logs. After you've captured a slow query, you can reset the settings back to their defaults like this:

PUT /_cluster/settings
{
  "persistent": {
    "transport.tracer.include": null,
    "logger.org.elasticsearch.search": null,
    "logger.org.elasticsearch.tasks.TaskManager": null,
    "logger.org.elasticsearch.transport.TransportService.tracer": null
  }
}

Nenad · April 17, 2019, 9:54am

Thank you on your help, I have set the log settings:

{"persistent":{"logger":{"org":{"elasticsearch":{"search":"TRACE","tasks":{"TaskManager":"TRACE"},"transport":{"TransportService":{"tracer":"TRACE"}}}}},"transport":{"tracer":{"include":"indices:data/read/search*"}}},"transient":{}}

To be sure I do this the right way: I have a for loop on shell that execute dump command wait 200ms and repeat, during this I execute search query on shell until I catch when it slows down then I take the dump around this time.

Dump (time in microseconds):
1555493939844626530 https://pastebin.com/z8KhfaBY
1555493940778830070 https://pastebin.com/KHegMV9S
1555493941702770377 https://pastebin.com/BYdd5gBR
1555493942425515017 Slow query "took":1047
1555493942608281975 https://pastebin.com/XSBCtBiH
1555493944420553980 https://pastebin.com/jbEfHwjm

Trace Log: https://pastebin.com/PX4mqb2a

Aside of above, I have monitored network latency between the two nodes this is all the time below 0.5 ms.

I then catched that with 1 node running only, after update of documents is done the first request take longer time (warming), but it took much longer (800ms vs 200ms) as on 6.7.

DavidTurner · April 17, 2019, 11:03am

Right, that log is from node-1 and has these two lines in it:

[2019-04-17T11:39:02,442][TRACE][o.e.t.T.tracer           ] [node-1] [720416][indices:data/read/search[phase/query]] sent to [{node-2}{DKwENWkuQCi7T0hcIZkUfA}{1OkulvjIQhydOsXXY0K-YQ}{node-2-ip}{node-2-ip:9300}{ml.machine_memory=67161415680, ml.max_open_jobs=20, xpack.installed=true}] (timeout: [null])
[2019-04-17T11:39:03,484][TRACE][o.e.t.T.tracer           ] [node-1] [720416][indices:data/read/search[phase/query]] received response from [{node-2}{DKwENWkuQCi7T0hcIZkUfA}{1OkulvjIQhydOsXXY0K-YQ}{node-2-ip}{node-2-ip:9300}{ml.machine_memory=67161415680, ml.max_open_jobs=20, xpack.installed=true}]

This tells us that the query phase of this search took over a second on node-2, from node-1's point of view. Can you share the logs from both nodes so we can see things from node-2's point of view as well?

Nenad · April 17, 2019, 12:10pm

Here is the node-2 Log: https://pastebin.com/dRLPVr9X

DavidTurner · April 17, 2019, 12:26pm

Interesting, thanks. Here's the relevant lines from node-2:

[2019-04-17T11:39:02,442][TRACE][o.e.t.T.tracer           ] [node-2] [720416][indices:data/read/search[phase/query]] received request
[2019-04-17T11:39:02,442][TRACE][o.e.t.TaskManager        ] [node-2] register 14352 [netty] [indices:data/read/search[phase/query]] [shardId[[app1_profile][2]]]
[2019-04-17T11:39:03,479][TRACE][o.e.s.q.QueryPhase       ] [node-2] [app1_profile][2] source[{"from":0,"size":1,"query":{"bool":{"filter":[{"term":{"member_status":{"value":1,"boost":1.0}}},{"terms":{"app_id":["","app1"],"boost":1.0}},{"term":{"country":{"value":"CH","boost":1.0}}},{"term":{"gender":{"value":0,"boost":1.0}}},{"term":{"search_male":{"value":1,"boost":1.0}}},{"range":{"dob":{"from":"now-44y/d","to":"now-20y/d","include_lower":true,"include_upper":true,"boost":1.0}}},{"range":{"search_min_age":{"from":null,"to":37,"include_lower":true,"include_upper":true,"boost":1.0}}},{"range":{"search_max_age":{"from":37,"to":null,"include_lower":true,"include_upper":true,"boost":1.0}}}],"must_not":[{"term":{"is_hidden":{"value":1,"boost":1.0}}},{"term":{"member_id":{"value":1,"boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"profile":true,"_source":{"includes":["hash_id","town","member_id","photo","first_name","gender","dob","last_request_dt","created_dt","geo_distance"],"excludes":[]},"script_fields":{"geo_distance":{"script":{"source":"doc['geo_pt'].arcDistance(params.lat, params.lon)","lang":"painless","params":{"lon":8.53955,"lat":47.37706}},"ignore_failure":false}},"sort":[{"_geo_distance":{"geo_pt":[{"lat":47.37706,"lon":8.53955}],"unit":"km","distance_type":"arc","order":"asc","validation_method":"STRICT","ignore_unmapped":false}},{"last_request_dt":{"order":"desc"}},{"member_id":{"order":"desc"}}]}], id[], 
[2019-04-17T11:39:03,483][TRACE][o.e.t.TaskManager        ] [node-2] unregister task for id: 14352
[2019-04-17T11:39:03,483][TRACE][o.e.t.T.tracer           ] [node-2] [720416][indices:data/read/search[phase/query]] sent response

The time is all spent between receiving the task at 11:39:02,442 and starting to execute it at 11:39:03,479.

The most relevant hot_threads seems to be https://pastebin.com/XSBCtBiH, assuming that your log times are UTC+2h.

   Hot threads at 2019-04-17T09:39:02.612Z, interval=500ms, busiestThreads=999999, ignoreIdleThreads=false:

However there is literally nothing of interest in that thread dump. Bear with me, I'm going to speak to some colleagues.

DavidTurner · April 17, 2019, 12:29pm

Can you provide the gc.log from node-2 around this time?

Nenad · April 17, 2019, 12:40pm

Thank you for explanation.
Yes you're right UTC+2h.

gc.log: https://pasteby.com/DSLzu560mh

DavidTurner · April 17, 2019, 12:55pm

Could you try explicitly setting index.refresh_interval on this index? Doesn't really matter what you set it to, but 1s is the default in 6.7 so you could try that.

Nenad · April 17, 2019, 1:34pm

Awesome! This solved the Issue, thank you very much.

I have set interval of 1s.

Is there anything I should take care because I have set this?

Btw. I have make a shell script and monitored many hundred querys, they took 6-11 ms, rarely one had need 20 - 25 ms.

DavidTurner · April 17, 2019, 1:49pm

Ok, I think I can explain this. The change in 7.0 is this one, which is designed to avoid unnecessary refreshes on shard copies that are rarely searched, to improve indexing performance. However if you perform a search after a shard copy has gone search-idle and you've just indexed some documents then the search will wait until the refresh has completed before proceeding.

Setting the refresh interval to something (anything) prevents the shard from going search-idle. It shouldn't have much effect on an index that only rarely sees any indexing.

I'm still puzzled about two things:

I think you were searching this shard frequently enough to prevent it from going search idle anyway.
I do not understand why we couldn't see a thread performing the refresh.

Not to worry, I can look into these myself.

Nenad · April 17, 2019, 2:05pm

Yes this is really strange, because of 1. even if no user is using the search over the Web UI there are two processes working on production all the time that fire permanently search requests (not below 1 per second), means the two involved index are used 24/7 frequently.
If i can help with any tests on my side to catch the reason for 2. let me know.

s1monw · April 18, 2019, 11:08am

can you try setting "index.search.idle.after" : "10000d" for this index and see what happens? it's a dynamic setting

Nenad · April 18, 2019, 12:53pm

sure, shoud I set before index.refresh_interval = null?

s1monw · April 18, 2019, 12:59pm

you can set it to 1 second. that's fine!

Nenad · April 18, 2019, 1:05pm

/_settings' -H 'Content-Type: application/json'  -d '{
    "index" : {
        "refresh_interval" : null,  
        "search.idle.after" : "10000d"
    }
}'

/_settings' -H 'Content-Type: application/json'  -d '{
    "index" : {
        "refresh_interval" : "1s",
        "search.idle.after" : "10000d"
    }
}'

I have tested both, is working as it should with both settings.

Should I go on 7.0.0 with "index.search.idle.after" : "10000d" or is index.refresh_interval fine too as the set value is enough real time for my use case?

s1monw · April 18, 2019, 1:07pm

both are identical since setting it to null will reset to the default value and that is 1s. To fix your issue you only need to set "search.idle.after" : "10000d"

Nenad · April 18, 2019, 1:10pm

OK well, thank you for clarify.

system · May 16, 2019, 1:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to diagnose slow queries every 10 minutes exactly? Elasticsearch	9	2831	July 5, 2017
ELK stack needs tuning Elasticsearch	6	1238	July 6, 2017
Slower Query Response Times - Intermittent Elasticsearch	7	544	April 30, 2021
ES slowness and calls taking a long time to return Elasticsearch	5	613	July 6, 2017
Debug slow search query Elasticsearch	2	357	November 9, 2022

Bug? Elasticsearch 7 (after upgrade from 6.7) slow search during 3-5 seconds every 60 seconds

Related topics