Hi,
We have following Elastic server configuration in production :
2 ES Server nodes running Elastic server each and our built in indexer.
Physical machine details: 6GB RAM
Elastic Server MIN and MAX heap settings: 2GB
Current size of index on production: 56gb
Total number of indexes: 5
Total number of shards: 25 (5 shards per index)
We use elastic server and indexer to mine the logs that our various applications create and we try to provide realtime detailed statistics of different API that our applications use.
We have an application called web application 'X' that is client of Elastic server and is sending queries to elastic server. This application X is deployed from 8 tomcat silos and at a time there could be requests coming from different clients to this Elastic server box.
Traffic details:
Rate of hitting elastic server is not that high. Usually it is high on end of week i.e. friday when users of our app try to create csv reports. Apart from that our web app has a dashboard that gets loaded when end user logs in into application. On that dashboard, there are some components that query elastic server for some data For e.g. We query for Users historical data (top 10 current user activity based on timestamp) or counts. We cache the results for some time in our web app 'X', but if cache data is old and dashboard gets reloaded, we again query elastic server for updated stats. (**Cache time is configurable)
Issue:
So the queries that come from Dashboard are the queries that could create a log of traffic.
From a few days we are seeing issue on our elastic server backend that some of such queries are taking a lot of time and hence being logged in slow logs (The slow log threshold we have set in production is 10s)
These slow queries sometime take order of minutes (10,20,40 minutes) and probably are causing other reporting queries to run very slow and timeout.
Is there a way to figure out why the queries are taking too much time? Is there way to see in the logs how the query actually executes on elastic server?
Is there a way to find out if we can optimize the query?
Following query is one of the query that gets logged in slow index log and taking too much time to execute and causing other queries to ES fail.
[2013-01-31 00:17:08,414][-][WARN ][index.search.slowlog.query] [Basilisk] [vip2013][0] took[20m], search_type[DFS_QUERY_THEN_FETCH], total_shards[30], source[{"from":0,"size":10,"query_binary":null,"explain":false,"sort":[{"timeStamp":{"order":"desc"}}]}extra_source[],
In above query we are trying to oder by timestamp. We want to know what is this query_binary part? Why it is null?