Performance degraded after upgrading to 7.x

We were running elasticsearch 6.7.2 and decided to upgrade to 7.x but our performances has degraded since the upgrade. We use the cluster to ingest log data and ship all data via filebeat. Index names contain dates so each day a new index is created.

Every day when new indexes are being created at exactly the same time, the cluster goes into yellow and sometimes a red state since moving to 7.3.0. We tried increasing our index buffer size but that didn't seem to help. Yesterday the cluster went into a red state (stopped indexing) and took about a couple hours for it to recover back into a green state and start indexing again.

Occasionally we notice data nodes drop out of the cluster and restart which cause shard balancing.

Our rate of ingestion is about 25,000 to 35,000 documents per second on a cluster of 3 master, 6 client and 90 data nodes with SSD backed data directory, 60gb of memory, 8 core cpu and heap size set to 30.5GB.

Here is what our current config looks like. Any suggestions to help improve index creation performance during heavy ingest rates is greatly appreciated.

cluster.name: some-cluster-name
path.data: "/some/storage/location/on/ssd"
path.logs: "/var/log/elasticsearch"
node.max_local_storage_nodes: 1
bootstrap.memory_lock: true
gateway.recover_after_nodes: 2
gateway.recover_after_time: 10m
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.hosts_provider: ec2
discovery.ec2.host_type: private_ip
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
discovery.ec2.availability_zones:
- us-east-1a
- us-east-1b
- us-east-1c
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
cluster.routing.allocation.cluster_concurrent_rebalance: 50
cluster.routing.allocation.node_concurrent_incoming_recoveries: 5
cluster.routing.allocation.node_concurrent_outgoing_recoveries: 5
indices.recovery.max_bytes_per_sec: 250mb
indices.memory.index_buffer_size: 40%
network.host: 0.0.0.0
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.key: /some/path/to/ssl/key
xpack.security.transport.ssl.certificate: /some/other/path/to/cert
xpack.security.transport.ssl.certificate_authorities: /another/path/to/the/ca
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.http.ssl.enabled: false
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
xpack.security.authc.anonymous.roles: healthcheck
xpack.security.authc.anonymous.authz_exception: true

Cluster name and paths have been replaced with dummy data.

I was thinking of setting the index.translog.durability setting to async to try and reduce fsync on every request but i read that the performance gain isn't much and usually not recommended to change.

1 Like

How many indices and shards do you have in the cluster? How much space does this data take up on disk?

6k indicies, 85k shards (30k primary).

Each data node has 2tb of ssd storage and is currently only using up 300gb per node.

based on that it sounds like your average shard size is only jst over 300 MB in size if I have calculated correctly. That is very small and will be quite inefficient as it leads to a large cluster state that takes time to update. I would recommend reducing the number of shards quite dramatically, as is outlined in this blog post. It is often recommended to have an average shard size ranging from 10Gb to 50GB, although sometimes even larger shards are optimal. If your average shard size was 20GB that would leave you with 1350 shards in total, which corresponds to just 15 shards per node. This would make cluster updates faster and most likely also improve query performance.

Data ingestion varies greatly for each index. Some shards are 20gb in size while others are 5mb in size. Filebeat rotates indicies based on date. Is there a way to have filebeat only rotate an index if its past a certain size instead of doing date based rotation?

Also, how would ILM work for deleting older data? We have ILM process in place to delete data older than 45 days but if an index contains data for multiple days, will it only delete the documents older than 45 days or wait until the entire index is no longer updated for 45 days?

ILM allows you to create new indices based on size or age, so you can target a specific size and have a max age that works with your retention period. This would allow indices with low inflow to cut over once a month or so while larger indices can cut over much more frequently. This also allows to to get much more even index and shard sizes if you have data volumes that fluctuate over time.

I am not sure if ILM can make sure that all data is older than the retention period (Curator can) but you can get around this by adjusting the retention period based on the max period covered by each index.

If I let ILM do index rotation based on size, wouldn't i need to create the first index manually? Right now, we have index auto creation enabled so when all the filebeat instances send in data, the index gets created automatically with the new name and date suffix. I remember reading somewhere that I would need to create the first index manually for ILM index rotation to work properly. This wouldn't be feasible for us due to the large number of autogenerated indexes coming our kubernetes clusters based on namespaces.

That is a good point, and I do not think ILM works well with dynamic index creation like that. How come you are creating separate indices per name space? Do you have different retention periods per name space? How many different indices are you creating each day?

An alternative is to use the Rollover API with a 'max_size' criteria set to an optimal value rather than using per-day indices. Use a cron job for performing the periodic rollover checks. Works well for us.

https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-rollover-index.html

But regardless of the changes, my experience is 7.3.1 seems to be slower than 6.1.1.
I don't do rolling upgrade. I double write to 2 clusters.
So I have existing 6.1.1 vs new 7.3.1 clusters where my applications write the same data to both clusters.
The CPU utilization for the new 7.3.1 cluster is higher than the old cluster. I would say about 20 - 30 % more. I do enable sorting index in the new 7.3.1 cluster. Maybe that's why it uses more CPU?
But querying with aggregation in 7.3.1 is definitely slower comparing to 6.1.1 (even with sorting field).

I was testing on the index with pretty much exact same data. Every query on 7.3.1 is slower than 6.1.1 consistently. I keep each shard to no larger than 30gb before rolling over.

6.1.1 cluster has about 100 shards per node, where as 7.3.1 has around 61 shards per node. The main difference is the default shard setting reduced from 5 down to 1 between the 2 versions.
My test was only conducted on indices with > 8 shards (high volume index); therefore, those low shard indices are not in the query path.

Given that you appear to have an index heavy use case I am very surprised to hear you are using index sorting. As per the blog post introducing it it adds a lot of overhead at index time and is primarily designed for highly optimized search use cases where specific types of common queries can be optimized this way. Especially read the "When index sorting isn't a good fit" section towards the end as it is possible that you are wasting a lot of resources at index time for potentially very little gain. This is a major difference between the clusters and I would not be surprised if it has a significant impact on the difference in performance and resource usage.

Shard size and count can also affect query performance, so if this varies a lot you are not really comparing like for like.

We search based on date, org, and user ID; therefore, I have sorting fields on those.
I am not worried too much on the upfront CPU usage as it tells me when I need to add more nodes. I am indeed configuring the cluster to optimize on read performance.
What I am surprised is the query time takes longer, which is not what I expected.
The shard num and size are very similar. I'm pretty sure it is the difference in ES version.
It could be some default setting I am not configuring that got changed between the 2 versions. We deliberately keep both clusters similar because I was eager to compare the performance improvements.

Is there any reference about performance on 7.3.x that I can read? Do you guys do a query performance test between 6 & 7? If performance degradation is not expected, then something is definitely wrong with my setup and I want to figure it out.
I was excited about the sorting index feature originally. But I'm extremely disappointed from my experience.

You can find various benchmarks at https://benchmarks.elastic.co/index.html; one comparing versions that might be somewhat close to your use-case is https://elasticsearch-benchmarks.elastic.co/index.html#tracks/nyc-taxis/release.

BTW if you switch to index sorting and that uses more resources, I'd almost be surprised if it didn't affect the rest of the system as well (like the search speed). Unfortunately performance is already tricky and if you change multiple things at the same time, it's getting even harder. For example I remember the case when someone upgraded Elasticsearch and also their Java version. Performance degraded and after a lot of searching it turned out that the same JVM parameters worked very differently on the new Java version. Just as am example of how small things in combination can have bad effects and also lead you on the wrong path for debugging easily.

Moving forward I'd look into:

  • Is index sorting really a win for your use case?
  • Sanity check of the common stats (queues and IO, memory, CPU usage).
  • Look into general best practices for read-only indices (reducing the number of shards, doing a force merge — mostly things built into ILM already).

Thanks for the suggestions. I will take a look at the links.
I forgot about JVM version bump. But the bump is simply due to ES 7.3.1 using jvm 12; therefore, I assume it's not worse than before. Otherwise you guys wouldn't have bumped it.