We were running elasticsearch 6.7.2 and decided to upgrade to 7.x but our performances has degraded since the upgrade. We use the cluster to ingest log data and ship all data via filebeat. Index names contain dates so each day a new index is created.
Every day when new indexes are being created at exactly the same time, the cluster goes into yellow and sometimes a red state since moving to 7.3.0. We tried increasing our index buffer size but that didn't seem to help. Yesterday the cluster went into a red state (stopped indexing) and took about a couple hours for it to recover back into a green state and start indexing again.
Occasionally we notice data nodes drop out of the cluster and restart which cause shard balancing.
Our rate of ingestion is about 25,000 to 35,000 documents per second on a cluster of 3 master, 6 client and 90 data nodes with SSD backed data directory, 60gb of memory, 8 core cpu and heap size set to 30.5GB.
Here is what our current config looks like. Any suggestions to help improve index creation performance during heavy ingest rates is greatly appreciated.
cluster.name: some-cluster-name path.data: "/some/storage/location/on/ssd" path.logs: "/var/log/elasticsearch" node.max_local_storage_nodes: 1 bootstrap.memory_lock: true gateway.recover_after_nodes: 2 gateway.recover_after_time: 10m gateway.expected_nodes: 3 discovery.zen.minimum_master_nodes: 2 discovery.zen.hosts_provider: ec2 discovery.ec2.host_type: private_ip discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com discovery.ec2.availability_zones: - us-east-1a - us-east-1b - us-east-1c cloud.node.auto_attributes: true cluster.routing.allocation.awareness.attributes: aws_availability_zone cluster.routing.allocation.cluster_concurrent_rebalance: 50 cluster.routing.allocation.node_concurrent_incoming_recoveries: 5 cluster.routing.allocation.node_concurrent_outgoing_recoveries: 5 indices.recovery.max_bytes_per_sec: 250mb indices.memory.index_buffer_size: 40% network.host: 0.0.0.0 xpack.security.enabled: true xpack.security.transport.ssl.enabled: true xpack.security.transport.ssl.key: /some/path/to/ssl/key xpack.security.transport.ssl.certificate: /some/other/path/to/cert xpack.security.transport.ssl.certificate_authorities: /another/path/to/the/ca xpack.security.transport.ssl.verification_mode: certificate xpack.security.http.ssl.enabled: false xpack.monitoring.enabled: true xpack.monitoring.collection.enabled: true xpack.security.authc.anonymous.roles: healthcheck xpack.security.authc.anonymous.authz_exception: true
Cluster name and paths have been replaced with dummy data.
I was thinking of setting the
index.translog.durability setting to async to try and reduce fsync on every request but i read that the performance gain isn't much and usually not recommended to change.