Implementation of ELK stack in large company with huge data

Hello

We want to change implementation of ELK stack in our company, we having some challenge at this time:

  • Kibana is too slow (sometimes getting response from Elasticsearch take time more than 5 minutes)
  • Our service's logs are not real time (when load had been increase), It means logs incoming to Elasticsearch with delay

Our ELK structure is:

  • 2 Data Nodes
  • 3 Master Nodes
  • 1 Coordinate Node
  • 3 Logstash
  • 1 Kibana

And service's logs transfer by filebeat to Logstash and then transfer to Elasticsearch data node.
Every day we have 2TB logs.
I read most document in the internet and Elastic Documents, so my question is what is best disk-memory:ratio and best role selection for nodes(Hot,Warm,Cold,Content). Do we need to add new nodes in our structure?

1 Like

Welcome to our community! :smiley:

What is the output from the _cluster/stats?pretty&human API?

1 Like

Output:

{
    "_nodes": {
        "total": 7,
        "successful": 7,
        "failed": 0
    },
    "cluster_name": "logger",
    "cluster_uuid": "SbFVUzaQJaA7Bplz",
    "timestamp": 1618290359927,
    "status": "green",
    "indices": {
        "count": 78,
        "shards": {
            "total": 1038,
            "primaries": 994,
            "replication": 0.04426559356136821,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 40,
                    "avg": 13.307692307692308
                },
                "primaries": {
                    "min": 1,
                    "max": 40,
                    "avg": 12.743589743589743
                },
                "replication": {
                    "min": 0.0,
                    "max": 1.0,
                    "avg": 0.5512820512820513
                }
            }
        },
        "docs": {
            "count": 3613873613,
            "deleted": 1424413
        },
        "store": {
            "size": "17.4tb",
            "size_in_bytes": 19217875617643,
            "reserved": "0b",
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size": "11.6gb",
            "memory_size_in_bytes": 12456800872,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "2.3gb",
            "memory_size_in_bytes": 2519396883,
            "total_count": 325818879,
            "hit_count": 28361821,
            "miss_count": 297457058,
            "cache_size": 354447,
            "cache_count": 574446,
            "evictions": 219999
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 26723,
            "memory": "684.3mb",
            "memory_in_bytes": 717631112,
            "terms_memory": "508.9mb",
            "terms_memory_in_bytes": 533719696,
            "stored_fields_memory": "61.3mb",
            "stored_fields_memory_in_bytes": 64307720,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "63.8mb",
            "norms_memory_in_bytes": 66952640,
            "points_memory": "0b",
            "points_memory_in_bytes": 0,
            "doc_values_memory": "50.2mb",
            "doc_values_memory_in_bytes": 52651056,
            "index_writer_memory": "206.2mb",
            "index_writer_memory_in_bytes": 216264324,
            "version_map_memory": "59.4kb",
            "version_map_memory_in_bytes": 60862,
            "fixed_bit_set": "1.5mb",
            "fixed_bit_set_memory_in_bytes": 1657296,
            "max_unsafe_auto_id_timestamp": 1618272002781,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "binary",
                    "count": 62,
                    "index_count": 9
                },
                {
                    "name": "boolean",
                    "count": 331,
                    "index_count": 65
                },
                {
                    "name": "byte",
                    "count": 2,
                    "index_count": 2
                },
                {
                    "name": "date",
                    "count": 555,
                    "index_count": 77
                },
                {
                    "name": "date_nanos",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "date_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "double",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "double_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "flattened",
                    "count": 55,
                    "index_count": 7
                },
                {
                    "name": "float",
                    "count": 192,
                    "index_count": 49
                },
                {
                    "name": "float_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "geo_point",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "geo_shape",
                    "count": 2,
                    "index_count": 2
                },
                {
                    "name": "half_float",
                    "count": 57,
                    "index_count": 15
                },
                {
                    "name": "integer",
                    "count": 384,
                    "index_count": 52
                },
                {
                    "name": "integer_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "ip",
                    "count": 69,
                    "index_count": 35
                },
                {
                    "name": "ip_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "keyword",
                    "count": 5235,
                    "index_count": 75
                },
                {
                    "name": "long",
                    "count": 1506,
                    "index_count": 66
                },
                {
                    "name": "long_range",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "nested",
                    "count": 96,
                    "index_count": 17
                },
                {
                    "name": "object",
                    "count": 2089,
                    "index_count": 74
                },
                {
                    "name": "scaled_float",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "shape",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "short",
                    "count": 38,
                    "index_count": 37
                },
                {
                    "name": "text",
                    "count": 3140,
                    "index_count": 61
                }
            ]
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [
                {
                    "name": "pattern_capture",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "analyzer_types": [
                {
                    "name": "custom",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_char_filters": [],
            "built_in_tokenizers": [
                {
                    "name": "uax_url_email",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_filters": [
                {
                    "name": "lowercase",
                    "count": 1,
                    "index_count": 1
                },
                {
                    "name": "unique",
                    "count": 1,
                    "index_count": 1
                }
            ],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 7,
            "coordinating_only": 2,
            "data": 2,
            "ingest": 2,
            "master": 3,
            "ml": 0,
            "remote_cluster_client": 1,
            "transform": 2,
            "voting_only": 0
        },
        "versions": [
            "7.9.1"
        ],
        "os": {
            "available_processors": 42,
            "allocated_processors": 42,
            "names": [
                {
                    "name": "Linux",
                    "count": 7
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 16.04.7 LTS",
                    "count": 6
                },
                {
                    "pretty_name": "RHEL",
                    "count": 1
                }
            ],
            "mem": {
                "total": "117.7gb",
                "total_in_bytes": 126442561536,
                "free": "4.2gb",
                "free_in_bytes": 4599955456,
                "used": "113.4gb",
                "used_in_bytes": 121842606080,
                "free_percent": 4,
                "used_percent": 96
            }
        },
        "process": {
            "cpu": {
                "percent": 102
            },
            "open_file_descriptors": {
                "min": 385,
                "max": 7423,
                "avg": 2395
            }
        },
        "jvm": {
            "max_uptime": "8.6d",
            "max_uptime_in_millis": 749036439,
            "versions": [
                {
                    "version": "14.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "14.0.1+7",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 7
                }
            ],
            "mem": {
                "heap_used": "36.3gb",
                "heap_used_in_bytes": 39001742696,
                "heap_max": "55gb",
                "heap_max_in_bytes": 59055800320
            },
            "threads": 450
        },
        "fs": {
            "total": "19.8tb",
            "total_in_bytes": 21834422001664,
            "free": "2.3tb",
            "free_in_bytes": 2592108277760,
            "available": "1.4tb",
            "available_in_bytes": 1636765593600
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 7
            },
            "http_types": {
                "security4": 7
            }
        },
        "discovery_types": {
            "zen": 7
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "rpm",
                "count": 1
            },
            {
                "flavor": "default",
                "type": "deb",
                "count": 6
            }
        ],
        "ingest": {
            "number_of_pipelines": 3,
            "processor_stats": {
                "date": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "date_index_name": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "geoip": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "json": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "remove": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time": "0s",
                    "time_in_millis": 0
                }
            }
        }
    }
}

I set additional information about our cluster below:

  • Data nodes IOPS (write: 150000, read: 150000)
  • Size of each document had been indexed is 7KB
  • Every second 2500 document had been indexed

This picture show device IOPS:

Our cluster hardware:

  • 2 Data nodes (Each 48GB RAM, 10TB SAN)
  • 3 Master nodes (Each 2GB RAM, 10GB HDD)
  • 3 Logstash (Each 12GB RAM, 20GB HDD)
  • 1 Coordinate node (12GB RAM, 100GB HDD)
  • 1 Kibana (12GB RAM, 30GB HDD)

We want to save our document for 60 days (=60TB Data) and after 60 days delete them.

Is this below structure true for our requirement?

  • 3 Hot nodes (Each 100GB RAM, 3TB SSD)
  • 4 Warm nodes (Each 100GB RAM, 10TB SAN)
  • 3 Master nodes (Each 48GB RAM)
  • 3 Coordinate nodes (Each 48GB RAM)
  • 3 Logstash (Each 24 GB RAM)
  • 1 Kibana (24 GB RAM)

King regards.

Can you upgrade your JDK and Elasticsearch?
What does GC look like for your cluster?

In an earlier post you said that you have 2TB of data being indexed per day and now 60 days of data corresponds to 60TB. That does not add up to me.

If we make the assumption that the data volume stays the same once indexed (will depend on mappings and index settings) and that you want to have a replica shard enabled for high availability, 1TB of data per day with 60 days retention gives 120TB of storage needed. On top of that Elasticsearch probably needs about 15% headroom due to watermarks, which increases the size to almost 140TB.

The amount of data a node can handle depends on heap space as well as storage performance and search latency requirements. The volumes you have specified seems fine, but I would recommend you run a benchmark to verify this.

Based on this simple calculation I believe you will need additional data nodes and/or storage in the cluster to meet your resuirements.

@warkolm
I upgrade Oracle jdk-11.0.02 to Oracle jdk-11.0.10
why should I upgrade Elasticsearch?
Our Elasticsearch version is 7.9.1
I don't understand your mean about: What does GC look like for your cluster?

@Christian_Dahlqvist
Yeah, you are right. nowadays we have 1TB at least that it growth up to 1.5/2TB.
We want to save our logs for 60 days and we have 60TB disk at last.
I know if add some data nodes it may works, I want best practice to ELK cluster.
If you have any recommendations to our cluster tell me

I have same issue. I want to scale up my Elastic cluster with same conditions. please give us some recommendations.
Thank you.
@Christian_Dahlqvist @warkolm

1 Like

I would recommend having a look at these:

There are quite a few other blog post and webinars aout there as well around related topics.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.