“Hot-Warm” Architecture in Elasticsearch best practice

Hi,
I'm using Hot-Warm Elasticsearch to store and perform search on logs for my company.
My ES cluster has 8 hot nodes and 10 warm nodes located on 10 servers.
Each server has 64GB RAM.
2 servers that only run warm nodes work fine. I use the recommended setting on these two servers that gives 32GB RAM to ES heap and let the remain free.
The other 8 servers have both SSD and HDD disk so I run 1 hot node and 1 warm node.
I give 20GB RAM to hot node and 24GB RAM to warm node and let just 20GB RAM free.
Everyday, ES creates 12 indices, each for every 2 hours. Each index has 8 shards located on 8 hot nodes.
My daily curator settings are:

  1. Move all indices older than 2 days to warm.
  2. Close all indices older than 30 days.
  3. Delete all indices older than 50 days.
    Data information:
    Each indices' size is about 250-320GB.
    All open indices' size for 30 days is 100TB.

Here are some problems that I encountered:

  1. Slow search perform.
  2. Heap on warm nodes is always red (>90% used)

I have some questions:

  1. How can I make heap on warm nodes healthy? (<90% used)
  • Add more RAM? But it is recommended that each machine should have only 64GB RAM. If yes, how much RAM should I add?
  • Forcemerge segments? I notice that when a node has ~10000 segments, it uses 15-16GB memory. But when I run forcemerge to max 1 segment/shard on all cluster, each node downs to 2000 segments and only uses 13-14GB memory. But I also found the maximum size of segments in es5.0/lucene6 is 5GB. If I run forcemerge to 1 segment/shard, each segment size will be 30-40GB. So should I run forcemerge to one segment? And is it true that fewer segments will use less memory?
  1. How can I improve search performance? Are there any tricks on doc_type, mapping, segments... to improve search performance?

Thanks in advance.

2 Likes

The recommended heap size for Elasticsearch should never cross the compressed oops threshold. Typically, I see this at just north of 30.5G heap sizes. If you have set yours to 32G, I'd check the logs to make sure you see:

heap size [YOUR_HEAP_SIZE], compressed ordinary object pointers [true]

This is not best practices, as that is to preserve 50% of system memory for filesystem caching.

My math has this at a total of 12 * 2 * ( 8 * 2 ) (assuming 1 replica shard per primary shard), which is 384 shards per 2 hour time period, or 4,608 shards per day. Based on your described architecture, and assuming shards to be distributed fairly equally across your nodes, that would be 1,152 shards per hot node, and 13,824 per warm node.

The combination of most of your warm nodes having only 24G heaps plus the high shard count explains both the memory pressure and the slow search. For a typical setup, I would not exceed 20 shards per gigabyte of heap on a hot node, and no more than double that on a warm node. It causes memory pressure when you do, as each active shard has a maintenance cost. The evidence is in the numbers you shared.

Forcemerging can help, but it is not a quick process. It is absolutely okay for a segment to be bigger than 5GB. That is simply the default TieredMergePolicy default for internal, automatic merges. A Forcemerge can make a segment as large as Lucene will allow (there's a maximum document count you cannot exceed). Yes, fewer segments will use less memory, but not enough less to counter the heap usage you have shared.

Even though typical best practices recommendation is to not let shards grow past 50G, you may need to do so in your use case. It's not impossible to have 100G or even 200G shards. The trade-off is that takes longer to migrate or recover the large shards.

Slowness is still less bad than having a cluster potentially fall over due to it being too full (of shards). Fix the high number of shards first, and this will correct itself (to an extent) as a direct consequence.

2 Likes

By the way, this is a huge problem. The Elasticsearch JVM will be trying to run garbage collection constantly, one attempt after another, because the JVM is set to auto-trigger garbage collection at 75% utilization. This also leads to slow search, slow indexing, slow everything. This is a tremendous amount of memory pressure, and the only way to correct this is to either close a lot more indices, or to add more nodes to spread the shard count out more. Even adding more RAM to permit a 30G heap will only help a tiny bit.

2 Likes

I agree with Aaron's recommendations in that you have far too many shards and need to reduce this dramatically. Please read this blog post on shards and sharding practices and then change how you create and manage indices. I would recommend changing to daily indices in order to increase the average shard size as per the recommendations in the blog post. We are also doing a webinar this Thursday on this topic, which may be useful.

Elasticsearch keeps a lot of data off-heap, which means that the size of the file system cache is important for good performance. You will need to adjust the amount of heap you give nodes if running multiple nodes on a host with just 64GB RAM. As Elasticsearch by default assumes all nodes are equal, you may want to run 2 nodes, each with 16GB heap, on all hosts. Another option might be to rearrange your disks and run a single node, hot or warm, per host with a 30.5GB heap.

Forcemerging down to a single segment can save a lot of heap and is definitely recommended. It is however as Aaron describes a quite expensive and slow process and uses a lot of disk I/O.

1 Like

Thanks for quick answers @ Christian_Dahlqvist, @ theuntergeek!

My math has this at a total of 12 * 2 * ( 8 * 2 ) (assuming 1 replica shard per primary shard), which is 384 shards per 2 hour time period, or 4,608 shards per day. Based on your described architecture, and assuming shards to be distributed fairly equally across your nodes, that would be 1,152 shards per hot node, and 13,824 per warm node.

I think I have expressed not clearly. Actually it is one indices every 2 hours, each index has 8 shards => 12*8=96 shards per day, x2 replicas =192 shards per day.
So it is 192 shards * 2 days = 384 shards on 8 hot nodes * 20GB heap = 160GB heap => 2 shards/GB heap. But the hot nodes are not the problems anyway.
On warm nodes, 192 shards * 40 days = 7680 shards on 10 hot nodes * 24GB heap = 240GB heap => 32 shards/GB bigger than 20-25 shards/GB as the recommendation in the link you shared.
I tried closing old indices and only keep 20 days on warm and now warm nodes heap oscillates between 50->80%



As I calculated, each shard size will be around 25-40GB. Should I reduce the number on shards and increase its size? How much heap-used percent is healthy for warm node? Because I really need to store and perform search on more than 22 days.

heap size [YOUR_HEAP_SIZE], compressed ordinary object pointers [true]

I can't cat this line. What file should I cat to see this line please?

We have tested indexing on one hot node before and it reached 15k Events Per Second.
Now we have 8 hot nodes and it only reaches 56k EPS avg ...

That means only 7k EPS per node. :sweat:
How can I increase Indexing performance? I notice that there is one setting for Max Merge Thread. Use default setting for SSD and change to 1 for HDD. Will more Max Merge Thread increase indexing EPS? How can I manage this when Hot-warm architect has both SSD and HDD?
This is my template settings for ES:

"settings": {
    "index": {
      "routing": {
        "allocation": {
          "require": {
            "box_type": "hot"
          }
        }
      },
      "refresh_interval": "30s",
      "number_of_shards": "8",
      "translog": {
        "flush_threshold_size": "2g"
      },
      "merge": {
        "scheduler": {
          "max_thread_count": "1"
        }
      },
      "query": {
        "default_field": "raw_*"
      },
      "unassigned": {
        "node_left": {
          "delayed_timeout": "5m"
        }
      },
      "number_of_replicas": "0"
    }
  }

If you index into a single node and get 15k EPS, adding a node with 1 replica shard configured will most likely not give you any additional indexing throughput as both the primary and replica shards do the same work. As data need to be transferred between nodes it may actually drop, which is why 7k EPS is not surprising as you have one replica configured. You could index without any replica, but that would increase the risk of data loss if there is a failure of some kind.

The best way to reduce heap usage will in my opinion be to start forcemerging your indices down to a single segment per shard. As this is expensive from a performance perspective, you may want to start doing it for new indices only, once they are read-only and before they are moved to the warm nodes. Over time the proportion of force merged indices on the warm nodes will increase, which should reduce heap pressure. If you have spare capacity you can also forcemerge existing indices on the warm nodes.

1 Like

Actually, I don't index with any replica and only set replica=1 after it finishes indexing and is set to read only, so the problem is something else.
Final question, is there any ES course that is suitable for my case? What course would you recommend for me?
Thank you so much!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

We cover such topics in Elasticsearch Eng1 & 2 courses. Check them for your region on our website: training.elastic.co

1 Like