Hi everyone,
We are running and ELK 7.1 deployment on IKS (IBMCloud Kubernetes Service).
Our setup is the following:
Infrastructure:
-IKS Cluster running Kubernetes 1.14.8_1536
-2x worker nodes of type Virtual Shared b2c.4x16 - 4vCPUs , 16GB RAM
-3x 20GB  BlockStorage with 1000 IOPS - for the elastic-master nodes
-2x1000GB BlockStorage with 1000 IOPS - for the elastic-data nodes
ELK 7.1 Deployment with HelmCharts main resource settings:
-3x elastic-master nodes
resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"
esJavaOpts: "-Xmx1500m -Xms1500m"
-2x elastic-data nodes
resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"
esJavaOpts: "-Xmx1500m -Xms1500m"
-1x kibana node
resources:
  requests:
    cpu: "100m"
    memory: "500m"
  limits:
    cpu: "1000m"
    memory: "1Gi"
-1x logstash node
no resources specified
logstashJavaOpts: "-Xmx1g -Xms1g"
We are forwarding syslog type logs with fluentd from IKS Cluster to logstash.
Our index size /day is aprox 15GB-20GB, as you can see we have 1 primary and 1 replica for every index:
health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-logs-2019.11.09 zqaTt02OTHCNj01TxyJXxA   1   1   30352029            0     30.5gb         15.2gb
green  open   logstash-logs-2019.11.08 lhLtQrvQTJ-eaGDH6T-Ogg   1   1   19440417            0     19.4gb          9.7gb
green  open   logstash-logs-2019.11.07 m5jGrS4kQ5-6H_DySNbuXw   1   1   29759146            0     29.3gb         14.6gb
green  open   logstash-logs-2019.11.06 Se608LsyQbKWt7llaxXlWQ   1   1   21048425            0     21.3gb         10.6gb
green  open   logstash-logs-2019.11.05 LzqSTVxETnCGfUdcySuoIQ   1   1   26868532            0     26.7gb         13.3gb
We are forwarding logs from our APP that rests on the 3app_nodes to our ELK that is on the 2elk_nodes. The direct allocation to specific nodes is done with nodeSeletor option at deployment level.
Currently in our oppinion this is not working so well. For example:
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
app_node1   1469m        37%    10077Mi         75%
app_node2  3317m        84%    10636Mi         79%
app_node3   1464m        37%    8937Mi          67%
elk_node1    1632m        41%    11317Mi         84%
elk_node2    1443m        36%    10703Mi         80%
At the above load, when I access kibana and set it for the last 1 hour and do a simple search for a keyword it displays ! Discover: Request Timeout after 30000ms error message on the dashboard and it is practically unusable, it is very slow,frozen, you cannot do much with it....
Tried to look at the jvm heap size settings/vaues
/_cat/nodes?v&h=id,disk.total,disk.used,disk.avail,disk.used_percent,ram.current,ram.percent,ram.max,cpu
172.30.205.51            8          72  22    1.88    1.88     2.06 m         -      elasticsearch-master-2
172.30.19.79             6          93  26    3.25    2.60     2.52 m         -      elasticsearch-master-0
172.30.74.142           78          97  32    4.86    5.42     5.74 d         -      elasticsearch-data-1
172.30.19.80            75          93  26    3.25    2.60     2.52 d         -      elasticsearch-data-0
172.30.74.141            6          97  33    4.86    5.42     5.74 m         *      elasticsearch-master-1
_cat/nodes?h=heap.max
1.4gb
1.4gb
1.4gb
1.4gb
1.4gb
Please let us know if our foundation blocks for the setup of this ELK are properly set. I mean are these resources enough for our needs? Are the settings in the deployment correct for the resource allocations?What should we change at this level so we can move to the next level -> tuning of the ELK deployment? (but this, only after the sizing is done accordingly, I guess....)
Looking forward to your answer.
Thank you,
Zoltan