Logstash doesn't seem to be the bottleneck. By sending the LS output to /dev/null, I was able to achieve an average rate of 97,000 documents per second.
When I replace the output with Elasticsearch it drops to an average of 20,000/s for a short period.
I have tried building up to find a limit. First increasing the shards, then the indices.
Right now on a single dedicated data node with 8 CPU, 32G RAM and a heap of 16G, I can't go over 25k/s regardless of how many or how few shards and indices I have.
Examples:
Ingesting 3 million documents (logs) with a single index with 2 shards and no replicas takes on average 20k/s
Ingesting 3 million documents with 2 indices and 2 shards each and still no replicas takes on average 23k/s
With a larger sample it breaks down
Ingesting 30 million documents with 2 indices and 2 shards each and still no replicas takes on average 13k/s
Ingesting 30 million documents with 6 indices each having 20 shards and still no replicas takes on average 14.4k/s
I was able to improve Logstash input by tweaking pipline workers and pipeline.batch.size but I have found nothing other than indices/shards/replicas for Elasticsearch.
One of the most important factors for ingest throughput is the performance of the storage, which you have not mentioned at all. This is discussed in the documentation around tuning for indexing speed and having local, fast SSDs is recommended. What type of storage are you using?
I was a bit quick to post it seems as that is exactly what I have replaced right now. The storage is at aws. We had a standard GP3 with 125M/s throughput and 3000 IOPS. I have just doubled the storage throughput and IOPS and I am seeing an improvement. I will run some more tests and leave my results here for anyone else who might need them.
Thanks for the link to the doc, I will read it now.
That is still way off the speed you get with a local SSD, so probably still qualifies as slow storage.
With this type of storage I would recommend indexing to as few shards as possible as that results in fewer larger writes. Also be aware that if you are specifying your own document IDs, each indexing operation will need to be treated as a potential update which will reduce throughput and make indexing slow down as indices grow in size.
AWS EBS GP3 default storage is indeed very slow. It is ok for most apps in our cluster but clearly not enough for ES.
Just by doubling the Throughput and IOPS I was able to achieve 28k/s which is more than double what I was able to do on the previous run and this with only 2 indices and 2 shards each.
@Christian_Dahlqvist I will take your advice and limit the number of shards and see how that goes. In prod we generate ~350 million log lines a day and we need to 10x that for the future so it looks like our costs will come from sizing the storage to the needs of ES.
If you are ingesting log data you may want to look at data streams and hot-warm architecture. In such architecture you have a number of nodes with very fast storage, e.g. i4i instances or similar, which handle all indexing. The warm tier then holds indices once they are no longer written to and can have slower hardware while still being able to serve queries efficiently.
I was under the impression that data streams were more for data such as metrics and not logs where you had to track values. Back to reading on data streams I guess.
As for warm tiers I would guess it's down to mounting different storage specs based on the tier. I'll read up on that too.
Looking at the i4i family, I see they have local SSD which I am guessing is far more performant than my GP3 storage and better suited for a data nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.