Creating indices and update_mapping tasks take way too long

Hi everyone,

We're experiencing some slowness on our elasticsearch 6.2.3 clusters when performing update_mapping and create indices tasks. One such cluster consists of 3 master nodes (4 cores / 14gb memory / 7gb heap), 3 client nodes (same setup), 40 hot data nodes (8 cores / 64gb memory / 30.5gb heap and 1.4tb local ssd disks) and 8 warm data nodes (8 cores / 32gb memory / 16gb heap and 5tb spinning disks). This cluster contains ~14,000 primary shards (25,000 total active shards) spread across ~8,900 indices, and calling GET _cluster/state shows the cluster state weights ~94mb. This is the current response to GET _cluster/health:
{
"cluster_name": "tier-1-01",
"status": "green",
"timed_out": false,
"number_of_nodes": 54,
"number_of_data_nodes": 48,
"active_primary_shards": 13722,
"active_shards": 25204,
"relocating_shards": 8,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}

As we're indexing our customer's data, where we cannot control its (changing) structure, we're heavily reliant on dynamic mapping. Also, to prevent backlogs during a UTC rollover event, we're opening new indices in advance in a daily manner, where around 1,200 new create index tasks are run during low-activity. Each index matches an index template, which injects our custom analyzers, the dynamic mapping, some custom fields and 3 aliases per index (for index rollover, searching etc). These indices live to the retention period of each customer (between 1 and 30 days, mostly 1 or 14). We also have a relatively high variation of docs count / size_in_byte between indices: some contain none (230b single shard indices where the user did not eventually send data that day), while some have much larger amounts of data (e.g. 1tb spread across 30-40 shards).

As discussed on github #30370, we're experiencing many timeouts on update_mapping when running our background tasks to open indices for the next day. This graph shows the high correlation between the two tasks:

It seems that create index tasks (taking between 4.5 and 5.5 seconds each), which elasticsearch assigns 'urgent' priority, put aside update_mapping tasks which are assigned 'high' priority, greatly increasing the chance of these tasks to complete after their 30 seconds timeout. However, as @DavidTurner and @bleskes suggest, this may be caused by some misconfiguration or another issue which we'd like some help investigating.

Let me know if any further information is needed, and we'll dive deeper accordingly.

Thanks!

How many GB/TB of data is this?

The cluster holds approximately 53TB
28TB in hot storage(SDD)
25TB in warm storage(HDD)

Index creation and bulk insert operations only accrue on the hot nodes.

You appear to be over-sharded. I would look to reduce that count ASAP. You can use _shrink and then also look at changing your daily index shard count.

How many shards per node is considered "over-shared"?

Depends.

But you should be aiming at <50GB per shard. Having lots of little shards wastes heap size.

We are in the 50GB limit. Also, I don't see any issues with java heap/GC.
Can you please elaborate what is the bottleneck that is causing such slowness?
We have a large cluster so I'm not sure that per data nodes we are over shared.

Although your average shard size at around 2GB per shard is rather small, I suspect it may be the significant number of indices (resulting in a large cluster state) that is the problem here. If you are using daily indices, you may be better off using the rollover API to increase the shard size and time period covered by each index in order to reduce this, as this would reduce the number of indices.

As you are hosting a number if customers in the same cluster and they seem to have their own indices, you may also benefit from instead using multiple smaller clusters as that would lead to smaller cluster states.

1 Like

@Christian_Dahlqvist Thanks for the response. We are considering both approaches (rollover or cluster split).
If the root cause is the large cluster state, will a stronger master will have any positive effect (as a temporary workaround)?

Updates to cluster state are single threaded, so more powerful nodes may not help. As you are creating new indices manually ahead of time, you could perhaps create new indices with the full mapping of the previous index rather than start from a clean index template and rely on dynamic mappings building it out again.

@Christian_Dahlqvist any chance that we need more memory in the master? we have 14GB and we allocated 7GB to the java heap.
Also, is 50% java heap is the best practice for the master?

As dedicated master nodes do not hold data, you can go higher with heap on these nodes if you need to. 75% to 80% might be a good starting point.

@Christian_Dahlqvist thanks we will change it. Any chance more memory will improve the master performance in our use-case?

If you are seeing heap pressure and/or long GC it may, but otherwise it may just be the size that slows it down.

@Christian_Dahlqvist here are our master heap stats


Per our diagnostics it looks fine. Any insights?
Thanks for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.