Hi, I have a general architectical question
I'm going to use Elasticsearch in the way of highest ingestion speed of data. I don't care about data save, the most important is throughput of data ingestion.
I have 100k new lines per second in my log file and I have to achieve delay not more than 2-3 seconds (log file -> logstash -> indexed in elk).
Would it be worth if create many datanode with replication factor to 0? Can Increasing of data nodes with replication factor 0 in the same time help me to increase a throughput of ingestion?
Is it timeseries data? why does it have to be no more tham 2-3 seconds delay?
decent nvme storage should give you 500000 writes easily. but testing is the fun part
Is it 100k lines per second in a single log file? If that is the case your bottleneck may very well be tailing, parsing and reformatting the data before sending it to Elasticsearch as this is unlikely to scale linearly.
With respect to scaling Elasticsearch, having multiple indexing nodes with a suitable primary shard count and no replicas will give the best throughput at the cost of resiliency and availability. You will need to make sure you are feeding it using parallel bulk requests in order to get the best out of it though, which depends on the design of your data processing pipeline.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.