KIbana - Elasticsearch : Request time out after 3000ms

Hi,
Elasticsearh version: 6.1.2 Kibana version : 6.1.2
We are processing logs through Filebeat - Logstash - ES - Kibana
Log file in CSV format with 23 columns , Total logs count upto 2-3lakh
We are creating a datatable using 15 columns of 23 in Kibana and trying to display data for 30days in data table visulization.
But as we keep on adding the columns in visualization 5 columns the response time (Latency time) from the statistics is fine but as we add more fields,
query time to fetch the fields gets doubled and
Then warnig messages appear : 2 of 5 Shards Failed, 3 of 5 Shards Failed, 4 of 5 Shards Failed and
then Error : Request time out after 3000ms
Steps followed to create Visualization :
Visualize --> New Visualization --> Data Table --> Select Index --> Split Rows(From Buckets) --> i.Select Aggregation : Terms,
ii.Fields : Clinic(name of the column)
iii.Size : 5000


While this set up works fine on AWS(cloud machines) but Performance degrades as we set this up on Staging and production which are On-premises.

We are also facing this issue when we are creating Visualization with more aggregation.

Hey,

It's difficult to give advice on the performance issues you are facing without having further details on how your Elasticsearch infrastructure is architected and built. Could you provide some more info on what setup you are using on both AWS & when building on-premise?

Also, it would be useful to see your elasticsearch & kibana configuration files (please don't forget to remove any secrets though, if you have any!).

Cheers,
Tom

The request timeout can be increased in kibana.conf file. I use 120s timeout.

@anhlqn Yes we can change Timeout value in Kibana.yml file but that didn't change the lag which you will face on the Visualization. By changing this setting we are able to load the data using DATA Table visualization but retrieving data from Elastic search which took more than 1 min is not good as per end user

Is there any known performance issue list available w.r.t. Kibana and Elasticsearch. Can someone point out that it will be really helpful.

We have similar set up on AWS and on-prem using ELK version 6.1.2
yml files are attached below.

==================== Elasticsearch Configuration ===============
#NOTE: Elasticsearch comes with reasonable defaults for most settings.
#Before you set out to tweak and tune the configuration, make sure you
#https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#--------------------------------- Cluster -----------------------------------
#Use a descriptive name for your cluster:
cluster.name: elasticsearch
#----------------------------------- Node ------------------------------------
#Use a descriptive name for the node:
#node.name: node-1
#Add custom attributes to the node:
#node.attr.rack: r1
#----------------------------------- Paths ------------------------------------
#Path to directory where to store the data (separate multiple locations by comma):
#path.data: /path/to/data
#Path to log files:
#path.logs: /path/to/logs
#----------------------------------- Memory -----------------------------------
#Lock the memory on startup:
#bootstrap.memory_lock: true
#Make sure that the heap size is set to about half the memory available
#on the system and that the owner of the process is allowed to use this
#limit.
#Elasticsearch performs poorly when the system is swapping the memory.
#--------------------------------- Network -----------------------------------
#Set the bind address to a specific IP (IPv4 or IPv6):
network.host: 127.0.0.1
#Set a custom port for HTTP:
#http.port: 9200
#For more information, consult the network module documentation.
#-------------------------------- Discovery ----------------------------------
#Pass an initial list of hosts to perform discovery when new node is started:
#The default list of hosts is ["127.0.0.1", "[::1]"]
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#discovery.zen.minimum_master_nodes:
#For more information, consult the zen discovery module documentation.
#---------------------------------- Gateway -----------------------------------
#Block initial recovery after a full cluster restart until N nodes are started:
#gateway.recover_after_nodes: 3
#For more information, consult the gateway module documentation.
#--------------------------------- Various -----------------------------------
#Require explicit names when deleting indices:
#action.destructive_requires_name: true

============Kibana Configuration===============
#Kibana is served by a back end server. This setting specifies the port to use.
#server.port: 5601.
#The default is 'localhost', which usually means remote machines will not be able to connect.
#To allow connections from remote users, set this parameter to a non-loopback address.
server.host: "kibanaOn-prem"
#Enables you to specify a path to mount Kibana at if you are running behind a proxy. This only #affects the URLs generated by Kibana, your proxy is expected to remove the basePath value #before forwarding requests
#to Kibana. This setting cannot end in a slash.
#server.basePath: ""
#The maximum payload size in bytes for incoming server requests.
#server.maxPayloadBytes: 1048576
server.maxPayloadBytes: 5242880

#The Kibana server's name. This is used for display purposes.
#server.name: "your-hostname"
#The URL of the Elasticsearch instance to use for all your queries.
elasticsearch.url: "http://elastichost:9200"
#elasticsearch.preserveHost: true
#Kibana uses an index in Elasticsearch to store saved searches, visualizations and
#dashboards. Kibana creates a new index if the index doesn't already exist.
kibana.index: ".kibana"
#The default application to load.
#kibana.defaultAppId: "home"
#If your Elasticsearch is protected with basic authentication, these settings provide
#the username and password that the Kibana server uses to perform maintenance on the Kibana
#index at startup. Your Kibana users still need to authenticate with Elasticsearch, which
#is proxied through the Kibana server.
#elasticsearch.username: "user"
#elasticsearch.password: "pass"
#Enables SSL and paths to the PEM-format SSL certificate and SSL key files, respectively.
#These settings enable SSL for outgoing requests from the Kibana server to the browser.
#server.ssl.enabled: false
#server.ssl.certificate: /path/to/your/server.crt
#server.ssl.key: /path/to/your/server.key
#Optional settings that provide the paths to the PEM-format SSL certificate and key files.
#These files validate that your Elasticsearch backend uses the same key files.
#elasticsearch.ssl.certificate: /path/to/your/client.crt
#elasticsearch.ssl.key: /path/to/your/client.key
#Optional setting that enables you to specify a path to the PEM file for the certificate
#authority for your Elasticsearch instance.
#elasticsearch.ssl.certificateAuthorities: [ "/path/to/your/CA.pem" ]
#To disregard the validity of SSL certificates, change this setting's value to 'none'.
#elasticsearch.ssl.verificationMode: full
#Time in milliseconds to wait for Elasticsearch to respond to pings. Defaults to the value of
#the elasticsearch.requestTimeout setting.
#elasticsearch.pingTimeout: 1500
#Time in milliseconds to wait for responses from the back end or Elasticsearch. This value
#must be a positive integer.
elasticsearch.requestTimeout: 60000
#List of Kibana client-side headers to send to Elasticsearch. To send no client-side
#headers, set this value to [] (an empty list).
#elasticsearch.requestHeadersWhitelist: [ authorization ]
#Header names and values that are sent to Elasticsearch. Any custom headers cannot be overwritten
#by client-side headers, regardless of the elasticsearch.requestHeadersWhitelist configuration.
#elasticsearch.customHeaders: {}
#Time in milliseconds for Elasticsearch to wait for responses from shards. Set to 0 to disable.
#elasticsearch.shardTimeout: 0
#Time in milliseconds to wait for Elasticsearch at Kibana startup before retrying.
#elasticsearch.startupTimeout: 5000
#Specifies the path where Kibana creates the process ID file.
#pid.file: /var/run/kibana.pid
#Enables you specify a file where Kibana stores log output.
#logging.dest: stdout
#Set the value of this setting to true to suppress all logging output.
#logging.silent: false
#Set the value of this setting to true to suppress all logging output other than error messages.
#logging.quiet: false
#Set the value of this setting to true to log all events, including system usage information
#and all requests.
#logging.verbose: false
#Set the interval in milliseconds to sample system and process performance
#metrics. Minimum is 100ms. Defaults to 5000.
#ops.interval: 5000
#The default locale. This locale can be used in certain circumstances to substitute any missing
#translations.
#i18n.defaultLocale: "en"

If you are creating an aggregation with 15 levels and have the size specified to 5000 (like in the screenshot), it will likely use up a lot of memory due to the high bucket count. Depending on the data volumes, it will likely need to get a lot of data from disk and use a fair bit of CPU for processing, so could get slow. If your disk performance and/or CPU count differs between your cloud and on-prem clusters, that could explain the differences in performance (assuming the data volumes are the same).

Do you see any evidence of long GC in the logs? What does CPU usage, disk I/O and iowait look like on the two clusters when running this large aggregation?

@Christian_Dahlqvist: Thanks for your input. We compared both the servers i.e. AWS servers where we have installed ELK stack and On-premises servers where we have ELK stack. There is not much of a difference. Please find below the comparison:
AWS (Elastic Search server)
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2399.927
BogoMIPS: 4865.52
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0,1

On-premises (Elastic search Server details)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.000
BogoMIPS: 4788.00
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0,1

Note: Apart from "Core(s) per socket" and socket(s) between AWS and On-premises, rest all are same. So we are not sure why we are facing issue on On-premises.

What about the storage used?

@Christian_Dahlqvist:
Since my query has been modified to increase the I/O operations would you be able to advice that increase in CPU cores on prem help?
Initial Query:
On-premises: Creating an aggregation with 6 levels is working fine,
On-premises:Creating an aggregation with 7 levels throws "Timeout Error".
Workaround
After increasing Timeout value in Kibana.yml file, (orignal 30 secnds to 60 seconds) it introduces high lag time to display the data.

On AWS: Creating an aggregation with 15 levels is working fine and response time is less than 30 sec.

Storage details are as below:-
AWS:df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 79G 2.3G 77G 3% /
On-premises
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/dddd-home1 147G 3.0G 137G 3% /export/ii

IOSTAT Details:
AWS
iostat
Linux 4.9.38-16.35.amzn1.x86_64 (ip-172-26-9-22) 04/09/2018 x86_64 (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.12 0.00 0.03 0.01 0.01 99.84
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
xvda 0.41 0.03 13.23 366144 174400704

On-premises
iostat
Linux 3.10.0-327.22.2.el7.x86_64 (msplap471) 04/09/2018 x86_64 (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.47 0.00 0.35 0.04 0.00 99.15
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 2.43 0.09 33.27 1230661 434246155
sdb 0.54 0.02 15.72 215837 205108812
dm-0 2.63 0.06 11.01 813589 143701056
dm-1 0.00 0.00 0.00 1268 4732
dm-2 1.44 0.02 15.72 215333 205108812
dm-3 2.89 0.03 22.26 412541 290540084

Are these IOSTATs generated while you are running the expensive query that times out? Can you run iostat -x while the query is executing? Can you also please gather the output of the cluster stats API from both clusters?

ComparsionClusterStats

@Christian_Dahlqvist: I have added key comparison details w.r.t Cluster Stats of AWS and ON-premises.

Can I see the full output?

AWS: Cluster Stats details:

{
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "elasticsearch",
"timestamp" : 1523274103134,
"status" : "yellow",
"indices" : {
"count" : 18,
"shards" : {
"total" : 86,
"primaries" : 86,
"replication" : 0.0,
"index" : {
"shards" : {
"min" : 1,
"max" : 5,
"avg" : 4.777777777777778
},
"primaries" : {
"min" : 1,
"max" : 5,
"avg" : 4.777777777777778
},
"replication" : {
"min" : 0.0,
"max" : 0.0,
"avg" : 0.0
}
}
},
"docs" : {
"count" : 656861,
"deleted" : 1
},
"store" : {
"size_in_bytes" : 454650168
},
"fielddata" : {
"memory_size_in_bytes" : 66696,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 120293,
"total_count" : 1216,
"hit_count" : 530,
"miss_count" : 686,
"cache_size" : 56,
"cache_count" : 71,
"evictions" : 15
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 481,
"memory_in_bytes" : 10253639,
"terms_memory_in_bytes" : 8661297,
"stored_fields_memory_in_bytes" : 383408,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 755392,
"points_memory_in_bytes" : 17554,
"doc_values_memory_in_bytes" : 435988,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1523012904918,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 1,
"data" : 1,
"coordinating_only" : 0,
"master" : 1,
"ingest" : 1
},
"versions" : [
"6.1.2"
],
"os" : {
"available_processors" : 2,
"allocated_processors" : 2,
"names" : [
{
"name" : "Linux",
"count" : 1
}
],
"mem" : {
"total_in_bytes" : 8373010432,
"free_in_bytes" : 2036281344,
"used_in_bytes" : 6336729088,
"free_percent" : 24,
"used_percent" : 76
}
},
"process" : {
"cpu" : {
"percent" : 0
},
"open_file_descriptors" : {
"min" : 313,
"max" : 313,
"avg" : 313
}
},
"jvm" : {
"max_uptime_in_millis" : 953604446,
"versions" : [
{
"version" : "1.8.0_151",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.151-b12",
"vm_vendor" : "Oracle Corporation",
"count" : 1
}
],
"mem" : {
"heap_used_in_bytes" : 1569732080,
"heap_max_in_bytes" : 4277534720
},
"threads" : 43
},
"fs" : {
"total_in_bytes" : 84415246336,
"free_in_bytes" : 81993969664,
"available_in_bytes" : 81891315712
},
"plugins" : [ ],
"network_types" : {
"transport_types" : {
"netty4" : 1
},
"http_types" : {
"netty4" : 1
}
}
}
}

In this cluster you have 18 indices. How many of these does the query cover? Is the number of indices and data volume the same in the other cluster?

ON-premises Cluster Stats Details:
{
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "STG",
"timestamp" : 1523252657835,
"status" : "yellow",
"indices" : {
"count" : 82,
"shards" : {
"total" : 406,
"primaries" : 406,
"replication" : 0.0,
"index" : {
"shards" : {
"min" : 1,
"max" : 5,
"avg" : 4.951219512195122
},
"primaries" : {
"min" : 1,
"max" : 5,
"avg" : 4.951219512195122
},
"replication" : {
"min" : 0.0,
"max" : 0.0,
"avg" : 0.0
}
}
},
"docs" : {
"count" : 7099376,
"deleted" : 7
},
"store" : {
"size_in_bytes" : 2816635232
},
"fielddata" : {
"memory_size_in_bytes" : 58048,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 0,
"total_count" : 635,
"hit_count" : 22,
"miss_count" : 613,
"cache_size" : 0,
"cache_count" : 12,
"evictions" : 12
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 2200,
"memory_in_bytes" : 21040374,
"terms_memory_in_bytes" : 17139209,
"stored_fields_memory_in_bytes" : 1628520,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 1559552,
"points_memory_in_bytes" : 127269,
"doc_values_memory_in_bytes" : 585824,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1523145617618,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 1,
"data" : 1,
"coordinating_only" : 0,
"master" : 1,
"ingest" : 1
},
"versions" : [
"6.1.2"
],
"os" : {
"available_processors" : 2,
"allocated_processors" : 2,
"names" : [
{
"name" : "Linux",
"count" : 1
}
],
"mem" : {
"total_in_bytes" : 16659423232,
"free_in_bytes" : 831090688,
"used_in_bytes" : 15828332544,
"free_percent" : 5,
"used_percent" : 95
}
},
"process" : {
"cpu" : {
"percent" : 1
},
"open_file_descriptors" : {
"min" : 994,
"max" : 994,
"avg" : 994
}
},
"jvm" : {
"max_uptime_in_millis" : 213671922,
"versions" : [
{
"version" : "1.8.0_131",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.131-b11",
"vm_vendor" : "Oracle Corporation",
"count" : 1
}
],
"mem" : {
"heap_used_in_bytes" : 3160691776,
"heap_max_in_bytes" : 8572502016
},
"threads" : 42
},
"fs" : {
"total_in_bytes" : 157342515200,
"free_in_bytes" : 154150662144,
"available_in_bytes" : 146134511616
},
"plugins" : [ ],
"network_types" : {
"transport_types" : {
"netty4" : 1
},
"http_types" : {
"netty4" : 1
}
}
}
}

@Christian_Dahlqvist: No, only 1 index covers the query, rest of the indices we created as part of others Logstash - Ingestion Jobs.

And this index has the same size and structure in both clusters?