AWS EC2 based cluster best practices

Hello,
We are currently running a 20 data nodes cluster on AWS EC2 instances.
Since we have only 1 replica per shard, and we cannot tolerate failure of more than 1 node at the time, we are not using ephermal storage but GP2 EBS volumes for the nodes.

With AWS Saving Plans, we have noticed that the EBS costs are most significant from the entire cluster price, so we are considering of volume type change (as we cannot decrease the cluster size).

Does anyone have experience with clusters running with ST1/SC1 EBS types, and can share some of his insights?
I did see this article (https://logz.io/blog/benchmarking-elasticsearch-magnetic-ebs/) that says in general that in terms of writing, SSD were better but in terms of reading, HDD actually gave better results.
But since this article is pretty old now I would like to hear some more about this planned operation we are considering.

Thanks in advance,
Lior

1 Like

It will depend on your use case, but this video shows a good comparison.

1 Like

Thanks @Christian_Dahlqvist,
actually our use case is logs data, I'll watch the video and update.

Hey @Christian_Dahlqvist,
So I have watched the video, and as I suspected, the writing consideration is most relevant.
I saw that the indexing rate decreasing significantly with HDD.
Is there a way which I can monitor my current ingest rate (events per second per node) in order to decide whether the change is feasible?

I'm currently using Telegraf & InfluxDB to monitor the ELK stack.

Thanks,
Lior

Since it is disk you are considering downgrading I would recommend monitoring disk utilization and IOPS for your current disks. You can also set up a temporary cluster with the configuration you are considering switching to and feed the two clusters in parallel for a period of time to see how they perform. I would expect anything slower than gp2 EBS to be too slow for nodes that perform indexing, but it may be possible to use it for e.g. cold nodes that store data that is rarely queried and where longer search latencies might be acceptable.

As you have multiple nodes spread across ideally 3 AZ you could consider nodes with ephemeral disks for the hot nodes.

1 Like

We are using hot-warm arch, with ST1 volumes successfully.

Filebeat -> Kafka -> LS -> ES Clients -> ES Hot Data Nodes.

Our daily volume is around 5TB, with about 5-6 Million messages per minute. You have to have optimized LS configurations along with good amount of ES Clients to keep connections with ES Data nodes.

We are running this setup since 7.4.x version without issues, with latest 7.7 due to heap improvements the nodes are pretty empty to do indexing at our huge volume.

Yes, our motive was same reduce the cost..

We are currently experimenting with SC1 volumes for our UltraWarm teir.. where in Cold Phase of ILM we would rollover indices to this nodes, but not actually freeze them.. and keep them searchable.. due to async searching it is quite possible.

You might also want to adjust the cache for nodes and shard for getting optimal searching

1 Like

Hey @Sc7565

what did you mean exactly with ES Clients, did you mean coordinator nodes?
I did the experiment by taking the same node type in my cluster, and adding a new member with ST1 EBS type to the cluster, but it seems from the monitoring that it performs poorly compared to the GP2 volumes, the new node IOPS were consistent near 100% and CPU usage almost 100% most of the time.

I'm considering now moving to ephermal storage instances to save the volume cost, and by purchasing saving plans to the ephermal storage instances we can reach a fair amount of costs savings.

Yes, ES Coordinating nodes.

Also, we did notice in 7.7.1 release even if we use gp2 or st1 volumes the OS CPU usage is always around 100% but same was not case for 7.4.0.

We are checking if ES 7.7.1 introduced some bug or not, as 7.8.0 is out we would use that one or would downgrade to 7.4.0 to prove this.

Which version did you test your setup against? and what was the size of st1 volume? we are using 5TB as it gives us good enough throughput

Hey @Sc7565,
First of all thank you for the willingness to help.
Our cluster is ELK 7.5.2, composed of 3 dedicated masters, 2 coordinator nodes (which runs Kibana+Grafana on it), and 20 data nodes - current type is r5a.2xlarge with GP2 EBS.
As I described earlier, I have attached another data node with same type, but with ST1 volumes (1 of 2TB and one of 1.5TB), to have an equal data partition on all nodes.
As you experienced yourself, the CPU usage is too high and affects cluster performance.

It is very weird as from the IOPS monitoring as well it seems that the ST1 node can't handle the pressure compared to the GP2 instances, so I went to try the next option - ephermal storage.

Can you share how do you know your daily volume? I believe that our daily volume might be a bit less, so I find it hard to understand how do you manage to run with ST1 volumes.

Lior

Hey @Christian_Dahlqvist,
As I'm looking now for the ephermal storage, I have some questions:

  1. Why is it important to spread the nodes across AZ's? it can cause network latency as the cluster performs synchronization tasks.
  2. Are there any best practices regarding working with ephermal storage for ElasticSearch clusters?
    As we currently have only 1 replica per shard, and we cannot afford increase the cluster size, failure of 2 nodes with ephermal storage meaning data loss.
  3. Currently we are running with 20 data nodes configured with data partition of 3.5TB.
    What is the recommended ratio for 64GB instance (31GB heap)? Does increasing to 4TB is possible or can affect performance?

Thanks,
Lior

For resiliency. If your cluster is spread across three AZ it will still be operational even if one complete AZ was to go down (which has happened).

If you want to be able to handle N nodes going down at the same time you do indeed need N+1 copies of the data. Using emphemeral storage can often give better performance at a better price, but whether the redundancy you have in the form of the cluster is sufficient is up to you to decide.

This depends a lot on how your data and load affects heap usage. Upgrading to the latest version of Elasticsearch can also help as there have been some quite significant improvements related to reductions in heap usage lately.

Hi @Lior_Yakobov,

3 Dedicated Master, 3 shards 1 replica and each shard is 33.33 GB

Filebeat --> Kafka --> Production LS (15 Pods) --> ES Clients (4 load balanced) --> ES Hot nodes (6 Nodes)

In templates

    "index": {
      "lifecycle": {
        "name": "production-ilm",
        "rollover_alias": "production"
      },
      "routing": {
        "allocation": {
          "require": {
            "box_type": "hot-app-logs"
          },
          "total_shards_per_node": "1"
        }
      },
      "mapping": {
        "total_fields": {
          "limit": "1000"
        }
      },
      "refresh_interval": "30s",
      "number_of_shards": "3",
      "translog": {
        "durability": "async"
      },
      "soft_deletes": {
        "enabled": "true"
      },

Hot Nodes -

Mixed instance ASG ( 16 cpu /128 mem) for each pod with 14/114. Heap is 30 GB
elasticsearch.yaml (below settings help to ingest more)

indices.memory.index_buffer_size: 30%
indices.memory.min_index_buffer_size: 96mb

ILM Policies are used to rollover data into Warm nodes.

Warm nodes -
Mixed instance ASG ( 16 cpu /64 mem) for each pod with 14/57. Heap is 30 GB
elasticsearch.yaml (below settings help to allow caching more to give better search results )

    indices.queries.cache.size: 40%
    indices.requests.cache.size: 25%

FYI - High OS Cpu usage on 7.7.1

We did verify something odd about >7.5 ES releases that leads to high (Almost 100%) OS cpu usage, but same configuration with 7.4.0 it is normal.

We also converted everything to 1AZ now to save $$$ as everything is in EBS volume, we trust AWS to not clobber that.

hope this helps.

We ingest sometime more than 6 TB of data a day, and there is lag but for few hours.

How come the cluster will stay operational with a complete AZ failure?
If I have 20 data nodes, spread over 3 AZs, it means around 7 nodes per AZ, so if an AZ fails in order to stay operational I need 7 replicas in the cluster, is that correct?

For costs reduction purposes, we are planing to stay with one replica only meaning tolerating a single node failure.

Thank you for the recommendation, I will work on upgrading as well in the next couple of weeks.

Seems from your configurations that your data retention is much lower from us, as if I would set this big cache sizes our cluster will probably not operate well.
Actually most of our data in the ELK (I believe around 70% of it) is not frequently accessed, but on the other hand, when we tried to use the hot/warm configuration the cluster had very bad performance and high load due to the ongoing shards movement.

Could it be that with ephermal storage the performance and load average will be better?
Plus, what is the most significant advantage with warm nodes?
Should I use same specs with more storage? or should warm nodes be with less resources since data is not frequently accessed?

Thanks,
Lior

No, that would be very inefficient. The manual describes how to achieve zone-level resilience with a single replica by using allocation awareness

1 Like

Thank you David, I will take it into consideration.
I do have a question though about hot/warm architecture.
Does warm nodes should be same specs as hot nodes, but with more storage? or warm nodes can be with less resources as data is not frequently accessed?

Thanks,
Lior

It totally depends on how you intend to use them.

Hey @DavidTurner,
Actually our usage is pretty basic, we are storing application logs and using them with Kibana and Grafana.
No excessive API usage, some heavy queries and aggregations.
I know that the sizing varies depends on the cluster usage, but is there a rough estimation for general use cases?
As I said, we are running with 20 data nodes with max heap (31GB), data partition 3.5TB, and cluster size is 60TB.
My question is how can I know whether 3.5TB/4TB per node is legit or I can even go with 10TB per node and then save half of the nodes in the cluster?
It is very hard to find a concrete answer for this question online, It will be helpful to get any indication about our thoughts moving forward.

Thanks in advance,
Lior

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.