Increasing indexing performance without sacrificing reliability

Hey guys, I've got a 6.5 PB cluster, composed of 150 d2.8xlarge instances (36 cores @ 2.5 GHz, 244 GB RAM, 24 hard disks (2 TB each, not in RAID)) that act as both data and ingest nodes. I have Filebeat running on 700+ prod service hosts, streaming around 3.5 GB/s of logs into hourly indexes. I've got a few questions regarding how I can optimize the performance of this cluster.

At the moment, I've configured my indexes to have 100 shards, and 0 replicas. Every hour, when the traffic starts pouring into a new index, I can see the CPU graphs of some servers drops, and other increase. This quite clearly shows which 100 servers were chosen for this hours' index, while the others are close to idle. Thus, I conclude that an a given server only contributes to indexing performance IF it stores a shard of the current index. Is this correct?

Secondly, from this page on maximizing index performance, I see that you guys recommend disabling replicas for initial loads of one-off data. That suggests to me that having 2 copies of a shard (1 primary, 1 replica) doubles that load, thus replication can't be used to increase indexing performance (although I understand that it does help with search performance).

So then, in order to get peak index performance out of my 150 nodes, I'll need an index with at least 150 shards. The issue there is that any one host going down will completely bring down the cluster. Are there any options? Have all hosts participate in the indexing, without each one becoming a risk to the whole cluster?

From what I gather, replica shards work like RAID 1, where you pay for 100% extra space, but have instant failover recovery. But there are many RAID schemes that can protect against a single drive failure, with only a small portion of extra space reserved for parity data. Recovering after a lost drive takes a little bit of processing time (to recompute the drive value from the other drives plus the parity), but it comes at the benefit of MUCH lower disk over-provisioning. Does Elasticsearch have anything similar?

When I get some free time, I'll work on switching to a hot/cold setup, but that poses its own challenges, because force-merging/shrinking on such large indexes becomes really difficult. Any advice?

What is your average shard size? What is the retention period of your data? Are you indexing into a single index at a time? Which version of Elasticsearch are you using? Are you running more than one node per host? Do you have dedicated master nodes?

  1. Average shard size: 10-50 GB (varies depending on the time of day).
  2. Retention period: As long as we can given the 6.5 PB limit, which is like a month.
  3. Are you indexing into a single index at a time? Yes, except for brief overlap during the rollover every hour.
  4. Which version of Elasticsearch are you using? 6.4.1
  5. Are you running more than one node per host? Do you mean more than 1 ES instance per server? Nope, just 1:1
  6. Do you have dedicated master nodes? Yes, 3 c5.4xlarge servers.

How much data do you have in the cluster at the moment? What heap size are you using?

  • How much data do you have in the cluster at the moment? About 1.5 PB

  • What heap size are you using? Half of the available 244 GB per d2.8xlarge server (144 GB per node).

What is the heap pressure looking like on the nodes with this volume of data?

Yes that is correct. Given that you have set up your disks in a multi-path configuration as far as I understand, you will be working a single disk per node quite hard per node all the time. Once you have more data in the cluster and this disk also need to serve more queries this could potentially result in bottlenecks. As these nodes have ephemeral disks and data will be lost on node failure, it is a bit risky to index without any replica. It does increase indexing performance and takes up less space but is considerably less reliable.

If you are looking for reliability and want to avoid data loss and reprocessing of data I would recommend having one replica shard enabled so you can lose a node without losing all copies of the shard.

You could have 75 primary shards and 1 replica, which gives 150 total shards.

Each node owns it's shards and require separate storage for it. It is not possible to rely on the redundancy of a shared file system and have multiple nodes access this data.

In order to reach high node densities, which you will need to do in order to reach the 6.5PB capacity, you will need to optimise your storage as outlined in this webinar. This might be more easily achieved in a hot-warm architecture.

If I calculate correctly generating 100 shards per hour will give around 72000 shards for a month, even without considering replicas. That is in my experience far too much for a single cluster, so I would recommend splitting this cluster into multiple smaller ones and use cross-cluster search to query across them. In order to hold as much data as possible, you also want to optimise the shard size. You can do this by using the rollover API and roll over to new indices based on size and not a fixed time. This can allow you to get all shards at the target size of e.g. 50GB, which will reduce the number of shards you hold in the cluster quite significantly. You may even go beyond 50 Gb as you have good networking in place.

1 Like

Yeah, I was thinking about this, but I don't have a finer-grained way to subdivide my production servers to make them ship their logs to different end points. I've already segregated my production servers by realm. This 6.5 PB cluster is just for NA, alone. EU and the rest of the world have their own separate ES clusters.

This is a good idea. I'll investigate it

Yeah, I've got 10 gbps networking. How much larger do you think I can go?

Are the underlying lucene segments of a shard spread across the available data paths? That would contribute to IO parallelism, and would be quite useful.

I would probably start somewhere in the 50GB to 100GB range and see how that works. As a single shard can contain a maximum of 2 billion documents, this may limit how far you can go.

Each shard and all its segments is always located on a single data path.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.