Is the 45,000 a peak or average rate? How long are you going to keep the data once it is indexed? What type of data is it? What do the query patterns and latency requirements look like? What type of hardware are you looking to deploy on?
Just FYI, we have just turned winlogbeat on many domain controllers, it has ignore_old: 72h, so we had an instant backlog of data. We started ingesting at about 10K events/sec to 6 Dell R640's running logstash and they are elastic data nodes on spinning disk. (Budget won over performance). Logstash had 3 ingest pipelines. I changed that to 8 and ingest went up to over 20K, increasing to 12 got to almost 50K/sec. Logstash pipelines will use a CPU per thread when busy. This was on top of all other normal ingest approx 3K/sec. We sustained this rate for a few hours until the 72 hours of old data was ingested.
Our design is for eventual 10K/sec, so I think we'll be able to do that.
Maybe, but maybe you could achieve your target performance with fewer nodes if you were using SSDs. The nightly benchmarks run on a 3-node cluster (with SSDs) and exceed the performance you're seeing here by quite some margin.
The size of the cluster also heavily depend on how long you are keeping the data and what your query requirements are, which you have not yet detailed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.