Tuning to handle extreme initial ingestion conditions (with logstash)

Hey folks.

Well I'll be frank, logstash has really rubbed me the wrong way recently. It's not logstash's fault, for the most part, but it's really difficult to get things working in my paticular use case.

Initially, we've began ingesting AWS ELB logs from Amazon S3, and things are okay. It's about 500 files per day, and it's managable with our Logstash/Elastic cluster that's hosted at elastic.co, which is a 2x8GB node.

Now, I'd like to begin ingesting Cloudfront logs which are a whole different beast. The way they are generated, for us, is literally thousands upon thousands upon thousands of tiny .gz files per day. To go back just 10 days or so, I currently have 1.3 million tiny files which is only maybe 100m records total. (We have about 1k CF dists x 5 min delivery x many edge nodes... it's a mess).

Anyway, I re-setup some things and for the back-logged ingestion - I'm ultimately trying to go back ~1 month and retain ~1 month going forward, I wrote a simple php script to iterate each of the 1.3 million files on disk (which is less than 10 days), and split it up over ~512 individual, gunzipped simple log files, and setup a special grok match so I can retain the filename (basically just prepending it).

Since I am going to also re-ingest a months worth of ELB logs with updates to the filtering, I went ahead and did the same and what I'm left with is about 12GB of uncompressed log files split across 1024 files exactly. I even have them stored on a tmpfs mount. Logstash was choking on attempting to just ingest 100k or so files, I couldn't imagine 1.3m.

Also, for this routine, I have increased the size of our elastic.co cluster literally as large as it will allow -- 2x58GB --- and fired up a massive 40-something cpu instance on AWS. This isn't cheap and i meant to return to managable levels once initial ingestion was over.

And, ....... finally get all things situated ... fired up 128 workers, default batch size and delay settings, ... and......

[INFO ] 2019-06-28 18:49:07.726 [[main]>worker52] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [3210207][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[access-cf-dealer-www-2019.06.20][0]] contai
ning [30] requests, target allocation id: 90dcD37zQNaFKKvzV1ip3Q, primary term: 1 on EsThreadPoolExecutor[name = instance-0000000010/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2b53c826[Running, pool size = 14, active threads = 14, queued tasks = 208, completed tasks = 9
10166]]"})                                                                                                                                                                                                                                                                                                                    
[INFO ] 2019-06-28 18:49:07.727 [[main]>worker52] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [3210207][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[access-cf-dealer-www-2019.06.20][0]] contai
ning [30] requests, target allocation id: 90dcD37zQNaFKKvzV1ip3Q, primary term: 1 on EsThreadPoolExecutor[name = instance-0000000010/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2b53c826[Running, pool size = 14, active threads = 14, queued tasks = 208, completed tasks = 9
10166]]"})

The cluster seems to be able to handle around 15k ingestions/sec, but it's choking. I did some reading, and it was suggested to just ask here.

Any suggestions?

  • Trying to get max ingestions/sec
  • Using elastic.co 2x58GB aws.data.highio.i3
  • Tried with 128, 64, and 32 workers. Lowering the woerker count helps, but it also significantly decreases the amount of ingestions/sec.
  • No other special settings.
  • Except Logstash's JVM heap size min/max has been set to 30 wallet-burning gigabytes.

If I recall correctly Logstash file input plugin is single-threaded and can have performance problems when reading very large number of files. Filebeat was created espacially to handle this and can handle this better. I would recommend trying to use Filebeat, which should allow you to have considerably fewer worker threads and a smaller heap for your Logstash instance.

[30] seems like a pretty small number of documents to be indexing at once. That's the number of documents going into this one shard, so the bulk that went to Elasticsearch might be larger.

[access-cf-dealer-www-2019.06.20] looks like you're using daily indices, but you've only got 12GB of data for the month so I think daily indices are inappropriately small. Fewer shards will mean larger shard bulks, and therefore fewer tasks to exhaust the queues.

pool size = 14 says you've got 14 indexing threads to play with on this node, so I'm not surprised that 128 workers is a bit overwhelming.

1 Like

Good point David. Have a look at the following resources as well:

https://www.elastic.co/guide/en/elasticsearch/reference/7.2/tune-for-indexing-speed.html

If you are indexing data covering a long time period, make sure you do not create a lot of small indices by splitting data into too granular indices and/or using daily indices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.