Hey folks.
Well I'll be frank, logstash has really rubbed me the wrong way recently. It's not logstash's fault, for the most part, but it's really difficult to get things working in my paticular use case.
Initially, we've began ingesting AWS ELB logs from Amazon S3, and things are okay. It's about 500 files per day, and it's managable with our Logstash/Elastic cluster that's hosted at elastic.co, which is a 2x8GB node.
Now, I'd like to begin ingesting Cloudfront logs which are a whole different beast. The way they are generated, for us, is literally thousands upon thousands upon thousands of tiny .gz files per day. To go back just 10 days or so, I currently have 1.3 million tiny files which is only maybe 100m records total. (We have about 1k CF dists x 5 min delivery x many edge nodes... it's a mess).
Anyway, I re-setup some things and for the back-logged ingestion - I'm ultimately trying to go back ~1 month and retain ~1 month going forward, I wrote a simple php script to iterate each of the 1.3 million files on disk (which is less than 10 days), and split it up over ~512 individual, gunzipped simple log files, and setup a special grok match so I can retain the filename (basically just prepending it).
Since I am going to also re-ingest a months worth of ELB logs with updates to the filtering, I went ahead and did the same and what I'm left with is about 12GB of uncompressed log files split across 1024 files exactly. I even have them stored on a tmpfs mount. Logstash was choking on attempting to just ingest 100k or so files, I couldn't imagine 1.3m.
Also, for this routine, I have increased the size of our elastic.co cluster literally as large as it will allow -- 2x58GB --- and fired up a massive 40-something cpu instance on AWS. This isn't cheap and i meant to return to managable levels once initial ingestion was over.
And, ....... finally get all things situated ... fired up 128 workers, default batch size and delay settings, ... and......
[INFO ] 2019-06-28 18:49:07.726 [[main]>worker52] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [3210207][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[access-cf-dealer-www-2019.06.20][0]] contai
ning [30] requests, target allocation id: 90dcD37zQNaFKKvzV1ip3Q, primary term: 1 on EsThreadPoolExecutor[name = instance-0000000010/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2b53c826[Running, pool size = 14, active threads = 14, queued tasks = 208, completed tasks = 9
10166]]"})
[INFO ] 2019-06-28 18:49:07.727 [[main]>worker52] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [3210207][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[access-cf-dealer-www-2019.06.20][0]] contai
ning [30] requests, target allocation id: 90dcD37zQNaFKKvzV1ip3Q, primary term: 1 on EsThreadPoolExecutor[name = instance-0000000010/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2b53c826[Running, pool size = 14, active threads = 14, queued tasks = 208, completed tasks = 9
10166]]"})
The cluster seems to be able to handle around 15k ingestions/sec, but it's choking. I did some reading, and it was suggested to just ask here.
Any suggestions?
- Trying to get max ingestions/sec
- Using elastic.co 2x58GB aws.data.highio.i3
- Tried with 128, 64, and 32 workers. Lowering the woerker count helps, but it also significantly decreases the amount of ingestions/sec.
- No other special settings.
- Except Logstash's JVM heap size min/max has been set to 30 wallet-burning gigabytes.