Hardware Recommendation

Hi All,

We performed few sample reports thru Kibana for understanding the stack.We are about to use Elastic Stack in production .
We would like to hear your suggestions on hardware for implementing.Here are my requirements.
1.Daily log volume 20 GB.
2.Data Retention period -3 years of data approx 25 TB
3.Do we need to consider any extra memory when it is to store logs in Elastic Search.(For ex When we used 2 MB for file in logstash input found 5 MB file storage in Elastic Search with default template in place)

please Suggest if we can go for any hadoop storage.
Please suggest the Elastic Search Cluster setup for better performance.

Aside from "it depends" (e.g. you didn't include any information on what your query patterns will look like) - you might find the following video

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing3

and docs

helpful.

Wrt. to Hadoop storage: https://www.elastic.co/products/hadoop gives you a two-way Hadoop/Elasticsearch connector. Not sure if this is what you are looking for.

Hope this helps,
Isabel

Hi mainec
Thanks for your reply.
we just wanted to know a basic idea on
What would be ideal cluster configuration (Number of node, CPU, RAM, Disk size for each node, etc) for storing the above mentioned volume of data in ElasticSearch?

I would join the question.
We are also evaluating to use the stack for Log-management.
We have fairly the same requirements as Mohana01 mentioned, despite the data retention.

Is there any point we can start with? Any rough recommendation on hardware to start with a stable but not oversized system?

For log analysis purpose, I would recommend you use the hot warm architecture per https://www.elastic.co/blog/hot-warm-architecture.

You can keep most recent logs (usually from last 2 weeks to 1 month) on hot nodes. 2x data nodes are enough in your case with 20GB/day * 30 days = 600 GB. If 20GB/day is your raw logs, they may be less or more when stored in Elasticsearch depending on your use case.

For logs older than 30 days, you can use curator to move the indexes to warm nodes. Usually, we don't search those logs a lot

For logs older than, say, 90 days, you can close the indexes to save resources and reopen them only when needed

For hot nodes, I would start with 2x servers, each with 64GB ram, 2x 4 to 6-core Intel xeon, 1TB SSD
For warm nodes, I would start with 2x servers, each with 64GB ram, 2x 4 to 6-core Intel xeon, 30 TB HDD 7200 RPM or so.

Any logs that are searched frequently should stay on hot nodes.

You should have dedicated master nodes and perhaps client nodes starting at 4 to 8 GB of RAM.

Currently I'm using the hot warm model + scale up approach instead of scale out to save costs and the clusters still work fine. Some numbers:

  • 6 to 8 TB (about 10 billion docs) available for searching with about 1 to 1.5 TB on hot nodes
  • 18 TB closed index on warm nodes to meet log retention requirements
  • 2x big servers each with 2x 12-core Intel Xeon, 256GB RAM, 2 TB SSD, 20+ TB HDD
  • 1x normal server to run master nodes
  • Each big server hosts multiple Elasticsearch node types (data, client, master) with max heap 30GB RAM

The concern with scale up is that if one big server is down during peak hour, you may run into performance issue. Configuration is also more complicated.

1 Like

Hi Anhlqn,

Thanks for response and suggestions.
Would like to know in one of my case would see like if i index a doc of 2 MB size that is getting stored in Elastic Search as 5 MB with dynamic mapping template.
Does the hardware sizing you using is after considering this scenario also or how to cover such a scenario.

For our logs, the average size of a doc is 500KB to 1MB, but most of the time, the size in ES is smaller than the raw size. That could be because of our mappings.

I believe that for logs, about 30% of the fields are used for full text search or aggregation, the rest should be set to either "index": "not_analyzed" or "index": "no".

Before indexing a new log type in ES, I pass the logs through Logstash and review the fields to decide which field should be indexed. Below is our default mapping for logs:

"mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "string_fields": {
            "mapping": {
              "index": "not_analyzed",
              "omit_norms": true,
              "type": "string"
            },
            "match_mapping_type": "string",
            "match": "*"
          }
        }
      ],
      "include_in_all": false,
      "properties": {
        "@timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "full_text_search_and_aggregation": {
            "include_in_all": true,
            "type": "string",
            "index": "not_analyzed"
          },
          "full_text_search_by_field_name": {
            "type": "string",
            "index": "analyzed"
          },
          "no_search_or_aggregation": {
            "type": "string",
            "index": "no"
          }
      }
    }

For user convenience, I include the fields that need full text search into the _all field so that users can search without entering the field name. "include_in_all: false could be changed at any time, which is not the case for indexing type.

I've seen cases when an index size is 3x larger than it should be due to unnecessary mappings (using NGram and Edge NGram). The index that holds the tokens is 2x larger than the logs themselves which requires lots of resources and is very slow.

You can start with 2 servers each with

  • 96 GB RAM
  • 2x 6 or more core CPU
  • Spinning disk with the capacity you need

On each server, you can run

  • 1 data node 30GB heap size
  • 1 master node 4GB heap size
  • 1 client node 4GB heap size

You also need another standard server, may be 8GB of RAM, to run the 3rd master node (3 dedicated master nodes in a cluster).

  • If you have problem with disk I/O, follow the SSD model in my previous post.
  • If you want to scale out, just add more servers with 64GB RAM each to run more data nodes
  • If you want to scale up, add more RAM to the 2 servers and run more data nodes on them (multiple Elasticsearch instances per physical server)
  • Use Marvel to watch cluster resource usage and increase heap size for master and client nodes or moved them to dedicated servers if needed.

I believe a combination of scale out and up is good for both perfomance, high availability, and cost effective.

For the specified use-case, with reasonably low indexing volume (20GB/day) and a long retention period, I think going for a hot/warm architecture is overkill, unless very high query volumes are expected. If data is not being migrated over and volumes are expected to grow over time up to the 3-year retention point, I would start with 3 nodes that are master eligible and hold data. Both indexing and querying can use a log of RAM as well as CPU, I would go with machines with 64GB RAM, 6-8 CPU cores and 6-8TB of local attached spinning disk. This may or may not be able to hold the full data set once you get closer to the full retention period, but as you gain experience with the platform you will be able to optimize your mappings to make the best use of your disk space.

2 Likes

Do you have a recommendation for when to have dedicated master nodes? For instance, if I start with 3 nodes running both master and data roles, when should I add master only nodes:

  • at what indexing rates?
  • or the size of data in the cluster?
  • or the number of documents in the cluster?
  • or the number of queries per second?

Thanks,

I think it is impossible to specify that in terms of terms of data volume, indexing or query rates as this will greatly depend on the hardware used. The properties you want for a master eligible node is that it has constant access to system resources in terms of CPU and RAM and do not suffer from long GC which can force master election. For many small clusters with limited indexing and querying this is fulfilled by the nodes holding data and they can therefore often also act as master eligible nodes, especially when you have a relatively long retention period and data turnover will be low.

Once the size of your cluster grows beyond 3-5 nodes or you start to push your nodes hard through indexing and/or querying, it generally makes sense to start introducing dedicated master nodes in order to ensure optimal cluster stability. There is however no clearly defined point or rule here, and I have seen larger clusters without dedicated master nodes work fine as well as very small clusters being pushed very hard greatly benefitting from dedicated master nodes.

Based on posts in this forum I get the feeling that it is quite common for new users to start setting up dedicated master and data nodes earlier than necessary just because they can. For smaller deployments I generally always recommend starting off by setting up 3 master eligible nodes that also hold data. Depending on the host size, this setup can stretch quite far and is all a lot of users will ever need.

Thanks for the advice. One of my clusters has the following specs

  • 4 nodes (4 data and 3 master eligible) each with 30GB heap space running on servers with 64GB of RAM, 2x Intel Xeon X5650 2.67Ghz.
  • Elasticsearch 2.4.x on Windows server 2012
  • Indexing rate 2000/s to all 4 nodes, indexing latency 4 - 10 ms
  • Search rate is pretty low, 3/s
  • 2TB of data with about 4 billion docs
  • Heap usage on all nodes is constantly at 75% to 90%. Restarting a node lowers heap usage but not for long.

Is there a need to add dedicated master nodes in this scenario? Would it be more memory efficient to run this cluster on Linux rather than Windows?

Thanks,

I would start looking into why heap usage is so high as that seems to be the limit you are about to hit. What is the use case? You may however want to start a separate thread around that discussion.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.