Bug? Elasticsearch 7 (after upgrade from 6.7) slow search during 3-5 seconds every 60 seconds


(David Turner) #21

Ok, try the following settings:

PUT /_cluster/settings
{
  "persistent": {
    "transport.tracer.include": "indices:data/read/search*",
    "logger.org.elasticsearch.search": "TRACE",
    "logger.org.elasticsearch.tasks.TaskManager": "TRACE",
    "logger.org.elasticsearch.transport.TransportService.tracer": "TRACE"
  }
}

This should give us some more detailed logs. After you've captured a slow query, you can reset the settings back to their defaults like this:

PUT /_cluster/settings
{
  "persistent": {
    "transport.tracer.include": null,
    "logger.org.elasticsearch.search": null,
    "logger.org.elasticsearch.tasks.TaskManager": null,
    "logger.org.elasticsearch.transport.TransportService.tracer": null
  }
}

#22

Thank you on your help, I have set the log settings:

{"persistent":{"logger":{"org":{"elasticsearch":{"search":"TRACE","tasks":{"TaskManager":"TRACE"},"transport":{"TransportService":{"tracer":"TRACE"}}}}},"transport":{"tracer":{"include":"indices:data/read/search*"}}},"transient":{}}

To be sure I do this the right way: I have a for loop on shell that execute dump command wait 200ms and repeat, during this I execute search query on shell until I catch when it slows down then I take the dump around this time.

Dump (time in microseconds):
1555493939844626530 https://pastebin.com/z8KhfaBY
1555493940778830070 https://pastebin.com/KHegMV9S
1555493941702770377 https://pastebin.com/BYdd5gBR
1555493942425515017 Slow query "took":1047
1555493942608281975 https://pastebin.com/XSBCtBiH
1555493944420553980 https://pastebin.com/jbEfHwjm

Trace Log: https://pastebin.com/PX4mqb2a

Aside of above, I have monitored network latency between the two nodes this is all the time below 0.5 ms.

I then catched that with 1 node running only, after update of documents is done the first request take longer time (warming), but it took much longer (800ms vs 200ms) as on 6.7.


(David Turner) #23

Right, that log is from node-1 and has these two lines in it:

[2019-04-17T11:39:02,442][TRACE][o.e.t.T.tracer           ] [node-1] [720416][indices:data/read/search[phase/query]] sent to [{node-2}{DKwENWkuQCi7T0hcIZkUfA}{1OkulvjIQhydOsXXY0K-YQ}{node-2-ip}{node-2-ip:9300}{ml.machine_memory=67161415680, ml.max_open_jobs=20, xpack.installed=true}] (timeout: [null])
[2019-04-17T11:39:03,484][TRACE][o.e.t.T.tracer           ] [node-1] [720416][indices:data/read/search[phase/query]] received response from [{node-2}{DKwENWkuQCi7T0hcIZkUfA}{1OkulvjIQhydOsXXY0K-YQ}{node-2-ip}{node-2-ip:9300}{ml.machine_memory=67161415680, ml.max_open_jobs=20, xpack.installed=true}]

This tells us that the query phase of this search took over a second on node-2, from node-1's point of view. Can you share the logs from both nodes so we can see things from node-2's point of view as well?


#24

Here is the node-2 Log: https://pastebin.com/dRLPVr9X


(David Turner) #25

Interesting, thanks. Here's the relevant lines from node-2:

[2019-04-17T11:39:02,442][TRACE][o.e.t.T.tracer           ] [node-2] [720416][indices:data/read/search[phase/query]] received request
[2019-04-17T11:39:02,442][TRACE][o.e.t.TaskManager        ] [node-2] register 14352 [netty] [indices:data/read/search[phase/query]] [shardId[[app1_profile][2]]]
[2019-04-17T11:39:03,479][TRACE][o.e.s.q.QueryPhase       ] [node-2] [app1_profile][2] source[{"from":0,"size":1,"query":{"bool":{"filter":[{"term":{"member_status":{"value":1,"boost":1.0}}},{"terms":{"app_id":["","app1"],"boost":1.0}},{"term":{"country":{"value":"CH","boost":1.0}}},{"term":{"gender":{"value":0,"boost":1.0}}},{"term":{"search_male":{"value":1,"boost":1.0}}},{"range":{"dob":{"from":"now-44y/d","to":"now-20y/d","include_lower":true,"include_upper":true,"boost":1.0}}},{"range":{"search_min_age":{"from":null,"to":37,"include_lower":true,"include_upper":true,"boost":1.0}}},{"range":{"search_max_age":{"from":37,"to":null,"include_lower":true,"include_upper":true,"boost":1.0}}}],"must_not":[{"term":{"is_hidden":{"value":1,"boost":1.0}}},{"term":{"member_id":{"value":1,"boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"profile":true,"_source":{"includes":["hash_id","town","member_id","photo","first_name","gender","dob","last_request_dt","created_dt","geo_distance"],"excludes":[]},"script_fields":{"geo_distance":{"script":{"source":"doc['geo_pt'].arcDistance(params.lat, params.lon)","lang":"painless","params":{"lon":8.53955,"lat":47.37706}},"ignore_failure":false}},"sort":[{"_geo_distance":{"geo_pt":[{"lat":47.37706,"lon":8.53955}],"unit":"km","distance_type":"arc","order":"asc","validation_method":"STRICT","ignore_unmapped":false}},{"last_request_dt":{"order":"desc"}},{"member_id":{"order":"desc"}}]}], id[], 
[2019-04-17T11:39:03,483][TRACE][o.e.t.TaskManager        ] [node-2] unregister task for id: 14352
[2019-04-17T11:39:03,483][TRACE][o.e.t.T.tracer           ] [node-2] [720416][indices:data/read/search[phase/query]] sent response

The time is all spent between receiving the task at 11:39:02,442 and starting to execute it at 11:39:03,479.

The most relevant hot_threads seems to be https://pastebin.com/XSBCtBiH, assuming that your log times are UTC+2h.

   Hot threads at 2019-04-17T09:39:02.612Z, interval=500ms, busiestThreads=999999, ignoreIdleThreads=false:

However there is literally nothing of interest in that thread dump. Bear with me, I'm going to speak to some colleagues.


(David Turner) #26

Can you provide the gc.log from node-2 around this time?


#27

Thank you for explanation.
Yes you're right UTC+2h.

gc.log: https://pasteby.com/DSLzu560mh


(David Turner) #28

Could you try explicitly setting index.refresh_interval on this index? Doesn't really matter what you set it to, but 1s is the default in 6.7 so you could try that.


#29

Awesome! This solved the Issue, thank you very much.

I have set interval of 1s.

Is there anything I should take care because I have set this?

Btw. I have make a shell script and monitored many hundred querys, they took 6-11 ms, rarely one had need 20 - 25 ms.


(David Turner) #30

Ok, I think I can explain this. The change in 7.0 is this one, which is designed to avoid unnecessary refreshes on shard copies that are rarely searched, to improve indexing performance. However if you perform a search after a shard copy has gone search-idle and you've just indexed some documents then the search will wait until the refresh has completed before proceeding.

Setting the refresh interval to something (anything) prevents the shard from going search-idle. It shouldn't have much effect on an index that only rarely sees any indexing.

I'm still puzzled about two things:

  1. I think you were searching this shard frequently enough to prevent it from going search idle anyway.

  2. I do not understand why we couldn't see a thread performing the refresh.

Not to worry, I can look into these myself.


#31

Yes this is really strange, because of 1. even if no user is using the search over the Web UI there are two processes working on production all the time that fire permanently search requests (not below 1 per second), means the two involved index are used 24/7 frequently.
If i can help with any tests on my side to catch the reason for 2. let me know.


(Simon Willnauer) #32

can you try setting "index.search.idle.after" : "10000d" for this index and see what happens? it's a dynamic setting


#33

sure, shoud I set before index.refresh_interval = null?


(Simon Willnauer) #34

you can set it to 1 second. that's fine!


#35
/_settings' -H 'Content-Type: application/json'  -d '{
    "index" : {
        "refresh_interval" : null,  
        "search.idle.after" : "10000d"
    }
}'
/_settings' -H 'Content-Type: application/json'  -d '{
    "index" : {
        "refresh_interval" : "1s",
        "search.idle.after" : "10000d"
    }
}'

I have tested both, is working as it should with both settings.

Should I go on 7.0.0 with "index.search.idle.after" : "10000d" or is index.refresh_interval fine too as the set value is enough real time for my use case?


(Simon Willnauer) #36

both are identical since setting it to null will reset to the default value and that is 1s. To fix your issue you only need to set "search.idle.after" : "10000d"


#37

OK well, thank you for clarify.