Hi,
Our systems are generating CSV files containing all events that happened during the past hour.
There are about 15.000 CSV files generated every hour, containing thousand of events each.
A quick estimate would mean having about 70,000 events captured every second in these files.
Each line of each csv contains one event, organized around a unique ID, a CSV ID, a timestamp, a free text field, then a few numerics.
The 2 main use cases would be to make a full text search on the free text field, or to make aggregations on the numeric fields given time buckets.
The data would ultimately reach about 5-6Tb once reaching the max retention.
The idea would be to use filebeat to send data to logstash and index everything in ES.
Being new to ES I tried creating one node and look at the ingestion rate. I barely reach 4k e/s and it eats all the 4 cores of the test machine.
I understand that increasing the number of logstash pipelines will help scale up my writes. Correct?
Then increasing the amount of ES nodes might also improve the situation here, however it will multiply my hot storage needs, correct?
Finally adding more CPU cores will also help scaling my writes as currently CPU is bottleneck on test machine.
If I am correct in the previous assumptions, is there any advice you could give me on sizing that? Is there any other setup that would help scaling my writes?
I'm also thinking of using rollover to limit each index size to 200Gb as I've read performance will degrade after that (shards above 20-25Gb).
Looking for any advice, thanks for your help.