ELK resource requirements

Hi everyone,

We are running and ELK 7.1 deployment on IKS (IBMCloud Kubernetes Service).
Our setup is the following:

Infrastructure:
-IKS Cluster running Kubernetes 1.14.8_1536
-2x worker nodes of type Virtual Shared b2c.4x16 - 4vCPUs , 16GB RAM
-3x 20GB BlockStorage with 1000 IOPS - for the elastic-master nodes
-2x1000GB BlockStorage with 1000 IOPS - for the elastic-data nodes

ELK 7.1 Deployment with HelmCharts main resource settings:

-3x elastic-master nodes

resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

esJavaOpts: "-Xmx1500m -Xms1500m"

-2x elastic-data nodes

resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

esJavaOpts: "-Xmx1500m -Xms1500m"

-1x kibana node

resources:
  requests:
    cpu: "100m"
    memory: "500m"
  limits:
    cpu: "1000m"
    memory: "1Gi"

-1x logstash node
no resources specified
logstashJavaOpts: "-Xmx1g -Xms1g"
We are forwarding syslog type logs with fluentd from IKS Cluster to logstash.

Our index size /day is aprox 15GB-20GB, as you can see we have 1 primary and 1 replica for every index:

health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-logs-2019.11.09 zqaTt02OTHCNj01TxyJXxA   1   1   30352029            0     30.5gb         15.2gb
green  open   logstash-logs-2019.11.08 lhLtQrvQTJ-eaGDH6T-Ogg   1   1   19440417            0     19.4gb          9.7gb
green  open   logstash-logs-2019.11.07 m5jGrS4kQ5-6H_DySNbuXw   1   1   29759146            0     29.3gb         14.6gb
green  open   logstash-logs-2019.11.06 Se608LsyQbKWt7llaxXlWQ   1   1   21048425            0     21.3gb         10.6gb
green  open   logstash-logs-2019.11.05 LzqSTVxETnCGfUdcySuoIQ   1   1   26868532            0     26.7gb         13.3gb

We are forwarding logs from our APP that rests on the 3app_nodes to our ELK that is on the 2elk_nodes. The direct allocation to specific nodes is done with nodeSeletor option at deployment level.

Currently in our oppinion this is not working so well. For example:

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
app_node1   1469m        37%    10077Mi         75%
app_node2  3317m        84%    10636Mi         79%
app_node3   1464m        37%    8937Mi          67%
elk_node1    1632m        41%    11317Mi         84%
elk_node2    1443m        36%    10703Mi         80%

At the above load, when I access kibana and set it for the last 1 hour and do a simple search for a keyword it displays ! Discover: Request Timeout after 30000ms error message on the dashboard and it is practically unusable, it is very slow,frozen, you cannot do much with it....

Tried to look at the jvm heap size settings/vaues

/_cat/nodes?v&h=id,disk.total,disk.used,disk.avail,disk.used_percent,ram.current,ram.percent,ram.max,cpu

172.30.205.51            8          72  22    1.88    1.88     2.06 m         -      elasticsearch-master-2
172.30.19.79             6          93  26    3.25    2.60     2.52 m         -      elasticsearch-master-0
172.30.74.142           78          97  32    4.86    5.42     5.74 d         -      elasticsearch-data-1
172.30.19.80            75          93  26    3.25    2.60     2.52 d         -      elasticsearch-data-0
172.30.74.141            6          97  33    4.86    5.42     5.74 m         *      elasticsearch-master-1

_cat/nodes?h=heap.max

1.4gb
1.4gb
1.4gb
1.4gb
1.4gb

Please let us know if our foundation blocks for the setup of this ELK are properly set. I mean are these resources enough for our needs? Are the settings in the deployment correct for the resource allocations?What should we change at this level so we can move to the next level -> tuning of the ELK deployment? (but this, only after the sizing is done accordingly, I guess....)

Looking forward to your answer.

Thank you,
Zoltan

1 Like

The specification for the dedicated master nodes look OK, although heap should only be set to 50% of allicated RAM. For the data nodes I think you have assigned far too little RAM, heap and CPU. Heap should here also be 50% of available RAM. If you intend to fill up the available terabyte of storage I would target around a 1:32 RAM to storage ratio, e.g. 32GB RAM and 16GB heap, together with 4 cores. If you make sure you follow best practices and do not have heavy querying you may very well get away with less than that though.

Hi @Christian_Dahlqvist,

Changed the 2 worker nodes with 2x 8vCPU ,32GB RAM worker nodes.
Changed the elastic-master nodes JVM to 1GB (50% of the 2GB limit)
Allocated 24 GB RAM for each elastic-data nodes -12GB JVM heap (50%)
Logstash and Kibana left intact.

Re-deployed the whole ELK, but unfortunately there is no visible improvement.
At Discover tab in Kibana when I set the timeframe for Today or for last 1 hour and do a simple search for a keyword I still get the same ! Discover: Request Timeout after 30000ms error message on the dashboard and it is practically unusable, it is very slow,frozen....

Please advise on this.
Thank you,
Zoltan

Hi again,

In our tests today, we observed that the Kibana ! Discover: Request Timeout error mainly appears if we search for keywords in the current index for today. Done some successful searches in past indexes without error, so this is an improvement for sure. (Considering also to increase the timeout value for the request timeout for Kibana , but I don`t see this as a fix, more like a workaround to the issue)

We suspect somehow related that data is written and read at the same time from the current index. If yes how can we tune this or what settings should we change?

The essential purpose for this ELK for us is to see current logs and be able to search fast in the current index to mitigate an application failure or error that is happening now or in the past couple of hours.

@Christian_Dahlqvist Please see my replies and give us a thought on this.
@DavidTurner Maybe take a look on this and give us your advice, it would be much appreciated.

Thank you,
Zoltan