We (in our company) recently upgraded our ELK environment (using new servers) from 1.7.2 to 6.2.4.
Another change we made is that before we used the logstash-forwarder on the hosts but now we use filebeat to forward the hosts.
Next to this we also changed the cluster setup:
Previous setup
1 server installed with Kibana, Elasticsearch & Logstash
New setup
2 elasticsearch servers (node)
1 server with kibana & logstash
The amount of logs changed practically the same (a few more) but there is a lot more parsing (dividing our logs into more fields and mapping some fields to integer) but next to that only the version and the setup has changed.
But we do notice that the old ELK stack (one server) has an average of 175 write IOPS while the new ELK stack has the following amount of write IOPS:
Master elasticsearch node: 542 write IOPS
Data elasticsearch node: 523 write IOPS
Is there any known reason that this happens or is this a configuration mistake?
More fields of more types can definitely contribute to this.
The storage configuration can also play a role. What is the storage setup on the old vs. the new system. In particular, does the new system use software RAID, especially 5 or 6?
Yes, I knew that more fields could be a reason for this but not in this order (the new cluster doesn't process three times as much logs so it seems weird that the write IOPS are so much higher).
The storage configuration is exactly the same, no changes whatsoever
The event rate by on the new ELK stack varies from 500 to 2500 (with peaks to 4000) events per second second, on the old one I can't really know exactly since we don't have any monitoring enabled on this but when I look at the indices the count of entries is the same (with a deviation of 5% max)
In Elasticsearch 6.x more data types use doc values than in Elasticsearch 1.x. This reduces heap usage but also results in additional disk I/O and larger size on disk. You may also want to compare mappings between the two versions to see if there are any other changes.
You might also want to do a GET /NAME-OF-INDEX/_settings on a similar index running on both systems and compare the value of refresh_interval a new segment (which is basically a file) will be created at each refresh. A very short refresh interval could be also be contributing to more write IOPS.
What do you mean with larger size on disk because with the new ELK cluster we use a lot less disk space for more data than before
And what differences in Mapping could cause higher write IOPS?
Good point. There is more data going into doc values, which increase the size of those data types, but at the same time you no longer index a _all field, which will save a lot of space. If you use the best compression codec you could save even more space.
One thing that has changed between 1.x and 6.x is that the transaction log is synced to disk much more frequently in order to improve resiliency, but I am not sure what impact this could have on the reported IOPS.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.