we are reading from S3 buckets - cloudtrail logs packed by Account under a single bucket. Each file varies from 5kb to 500kb or more.. 2 scenerios we tested with 4cpu,8gb docker. Larger file takes about 35mins to process. we generate over 500 files each day for that account. Iam afraid i will run over 24 hours to process.. Since i cannot scale this to multiple LS servers i a cluster reading from same S3 bucket there seems to be a limitation. NEED SOME ADVICE Quickly to scale and process this under 2 minutes. NOTE WE ALSO TESTED THIS WITH BATCH SIZE 2500. STILL IT TAKES 35 MINS. CPU utilization on server is under 30% JVM had just 2 gc collections...
Scenario : 1 - 287 number of small sized files (.gz files)
Tested with 287 number of small files of a day (10/31/2017) from particular account (xxxxxxx) and from a single region (east-1) , each file is less than 5KB and each file contains average of 50 records.
Logstash pipeline configuration :
batch size : 125
jvm 1g-3g.
Time taken to process and push the data to ES : 5 min
Scenario : 2 - Single Large sized file (500KB .gz file) with 3200 records
Tested with single file a day (11/30/2017) from particular account (XXXXXXX) and from a single region (west-2 ), file size is 500KB and it contains around 3200 records.
Logstash pipeline configuration :
batch size : 125
jvm 1g-3g
Time taken to process and push the data to ES : 35 min