Performance and Sizing Help and Insights

Question:
We are looking for some guidance or some commentary on our OpenDistro cluster which is deployed onto our Kubernetes infrastructure via Helm. I have noted the current configuration for the cluster below.
While our cluster seems stable, if there are slight bumps in the night the cluster seems to become unstable quickly. We recently had an issue where two (2) of the client containers were destroyed and recreated. This caused the masters to disconnect and the cluster to become unstable and yellow. Several shard had to be reinitialized which took ~36 hours to complete. During the yellow period we noticed a drop in the ingest rate by ~20%.
On previous occasions we recognized that garbage collection on the clients was impacting the rate of ingest, and we were seeing drops in the rate by 10-15% per day until the client containers were destroyed and redeployed (individually and serially).
We feel as though we are at the edge of a cliff with the current configuration of the cluster. As slight wind and it causes the cluster to fall and become yellow or red.
We are hoping that others can provide some insight into the next steps toward further stabilizing the cluster and enabling solid scaling for ingest. We would like to stop shooting in the dark with where in the cluster to devote resources and attention. It does not seem that there are any examples online that meet this same level of ingest rate on Kubernetes.

Version:

  • OD: 1.1.0
  • ES: 7.1.1

Data Type:

  • Syslog

Index Configuration:

  • Single daily index
    • Shards: 10
    • Replica: 1
  • Ingest Rate
    • Documents per Day: ~6,800,000,000
    • Size per Day: 1.8TB
  • Retention
    • Indices closed after: 7 days
    • Indices deleted after: 30 days

Architecture:

  1. Data:
    • Number: 10
    • CPU: 4
    • Memory: 32G
    • Heap: 16G
  2. Master:
    • Number: 5
    • CPU: 4
    • Memory: 16G
    • Heap: 8G
  3. Client:
    • Number: 5
    • CPU: 2
    • Memory: 8G
    • Heap: 4G

Storage:

  • Storage Backed:
    • NFS mounts to data nodes
    • Disks: 7.2K

Garbage Collection:

  • JAVA_OPTS: "-XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=75"

I have a few questions about the numbers you presented:

  • Is that 1.8TB data indexed per day the raw document size, the primary index size on disk or the total replicated index size on disk?
  • What is your average shard size?
  • Do you have monitoring installed? If so, what does the heap usage on the data nodes look like?
  • Is there anything in the logs that indicate long and/or frequent GC?

@Christian_Dahlqvist

  1. 1.8TB is the primary index size on disk.
  2. The average shard size is 180GB.
  3. We leverage Cerebro to give us "monitoring" and insight into the cluster. We also have a limited grafana dashboard. Which we use to monitor mostly just GC at this point.
    3a. I would appreciate some insight into some KPI or preformance metrics that we should add to grafana dashboard to monitor health.
  4. See number 3.

180GB shards is quite big. It is often recommended to keep the shard size around 50GB, primarily to speed up recovery and shard relocation. I would therefore probably recommend switching to use rollover so you can set a target index size and roll over to a new index when required.

If I calculate correctly that means each node holds about 2.5TB of open indices (over 10TB including closed), which is quite a lot for a node that is indexing heavily given the heap size you have specified. I would probably recommend scaling up or out the data nodes in the cluster. Given that you have so short retention period performing a force merge down to a single segment to save heap space might not be worth it.

I would recommend you have a look at the following resources:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.