Logstash is updating sincedb but indices are not creating

I am collecting ALB logs from S3 into elasticsearch using logstash. For now I wanted to collect more than one year of data from S3 and ingest into elasticsearch. I am creating indices for everyday with the name of index-name-*. When I start the logstash it's reading the logs and ingesting into elasticsearch but at some point it stop creating indices but when I check the sincedb it's updating. I am using m7g.2xlarge instance and allocated 6GB of for logstash heap as well. Please advice on this issue.

Welcome to the community.

Please provide more details:

  • Which LS are you using?
  • Have you checked LS logs?
  • Have you opened the sincedb file to see what LS did read?
  • Which the input plugin are you using? input-s3 or file?
  • Can you provide specific settings for the input plugin? Remove sensitive details like access key, server name...

How many files do you have in the bucket? I would not recommend using Logstash to read S3 buckets with a huge number of files, the performance is pretty bad to the point that it became unusable.

1 Like
input {
  s3 {
    "bucket" => "bucket-name"
    region => "ap-south-1"
    "type" => "alb-logs"
    "additional_settings" => {
      "force_path_style" => true
      "follow_redirects" => false
    }
    sincedb_path => "/volume/sincedb/public-alp"
  }
}

There are no error logs in the LS logs. I am using latest 8.12 LS. I checked the sincedb file its updating properly, but indices are not creating according to that.

TBH it's reading more files; it's reading ALB logs from 1st of June 2023 to till date. Some indices sizes are in GBs also. But I have allocated 6GB for the heap can I get more performance if I increase the size furthermore.

As I said in the previous reply its reading logs from 1st of June 2023 to till date. But it has successfully read the logs till 29th of March 2024 only failing to read logs after 29th March 2024 that's it just 4 days.

Have you changed conf file lately, especially end of March?
Have you checked ES disk space?
Have you checked data, maybe you have IF conditions which from some reason not meet any more?

This is the first time we are spinning up this environment. Disk space is available more than 1TB. There are no such conditions added.

Below is the filter that I've added

filter {
  if [type] == "alb-logs" or [type] == "alb-logs-private" {
    grok {
      match => ["message", "%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:loadbalancer} %{IP:client_ip}:%{NUMBER:client_port:int} (?:%{IP:backend_ip}:%{NUMBER:backend_port:int}|-) %{NUMBER:request_processing_time:float} %{NUMBER:backend_processing_time:float} %{NUMBER:response_processing_time:float} (?:%{NUMBER:elb_status_code:int}|-) (?:%{NUMBER:backend_status_code:int}|-) %{NUMBER:received_bytes:int} %{NUMBER:sent_bytes:int} \"(?:%{WORD:verb}|-) (?:%{GREEDYDATA:request}|-) (?:HTTP/%{NUMBER:httpversion}|-( )?)\" \"%{DATA:userAgent}\"( %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol})?"]
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
}

I experienced a behavior that, when it's processing the 29th of March data it created the index and the size increasing gradually at some point it stops ingesting more data and I can notice a considerable amount of size reduction in the last index it created for example: it creates the 29th of March 2024 index and it's size increase up to 2GB suddenly it stop ingesting data and the size reduced to 1.8GB. Now I've increased the instance type to 8xlarge that is nearly 32GB and allocated 10GB for the JVM as well

@Rios @leandrojmp Please provide some assistance I am struck with this issue for 2 days and couldn't find a workaround to fix this one.

The main issue here is not the resources that you have for logstash, scaling it vertically will not solve anything, the main issue is that logstash performance on large s3 buckets are pretty bad to the point that you can't use it, there are a couple of issues about this for years.

Buckets that receives logs from AWS services like buckets for ALB logs, Cloudtrail logs, VPC logs etc, can have millions of objects, thousands being created each minute, and the way that the s3 input logstash is built is that it will list the bucket every time to get a list of the objects it needs to download, but as mentioned the performance on large buckets is pretty bad.

The main issues tracking this on github are these:

In any case, listing buckets like that to download the files is a bad approach with or without logstash, as it cannot scale horizontally, the recommended way is to configure S3 notifications into a SQS queue and use a more modern tool like Filebeat to monitor this queue and download the files, this can be scaled horizontally, but this also does not solve the issue of historical data.

For historical data you should download the files using another tool, maybe writing a python script using the boto3 library or even using the aws cli, it doesn't matter, you just need to download the historical data outside logstash.

With the historical data on disk you can then use logstash to consume the files using the file input.

For current and real time data you should use S3 notifications to SQS queues and use Filebeat or Elastic Agent to consume the logs as Logstash also does not support this.

I had a similar case a couple of years ago when I needed to consume data from buckets with Cloudtrail logs and could not get it working with Logstash, after a couple of days trying I gave up and built a data pipeline where Logstash would only consume from Kafka, getting the data from s3 buckets and putting into kafka topics where done by other tools.

2 Likes

@leandrojmp Thanks for the insights. I got an idea from my Lead what if we use multiple s3 input blocks for each month with separate sincedb path to get historical data.

input {
  s3 {
    bucket => "-privalb-accesslogs"
    region => "eu-west-1"
    type => "private-elblogs"
    additional_settings => {
      force_path_style => true
      follow_redirects => false
    }
    sincedb_path => "/volume/sincedb/private-2023-06"
    prefix => "AWSLogs/xxxxxxx/elasticloadbalancing/eu-west-1/2023/06/"
  },
  s3 {
    bucket => "-privalb-accesslogs"
    region => "eu-west-1"
    type => "private-elblogs"
    additional_settings => {
      force_path_style => true
      follow_redirects => false
    }
    sincedb_path => "/volume/sincedb/private-2023-07"
    prefix => "AWSLogs/xxxxxxx/elasticloadbalancing/eu-west-1/2023/07/"
  }
}

will this help us.

One more thing somehow I managed to get certain amount of logs from S3 to elasticsearch through logstash but after it completed it should regularly get the ALB logs from S3 to ES but its stopping after that also. My doubt its acceptable that when we are reading large log files it's failing becuase of performance but when reading live changes why it is failing ?. Its's reading from S3 and ingesting into ES this should not fail right ?

Hello,

Not sure what I could add besides what was already mentioned, using the S3 input to get logs from large buckets does not work well on Logstash, you should look for other options as the ones I mentioned in the previous answer.

Using prefixes can help, but if you use multiple inputs with multiple prefix it is basically the same as not using them, if you want to use prefix to get historical data you should do it in steps, for example, first get all logs from 2023-06, after it finished, change your input to 2023-07 and etc.

As mentioned it would be way faster to get those logs outside logstash and configure logstash to read the already downloaded files.

I'm not sure what you are mentioning here, AWS buckets for logs like ALB, Cloudtrail, VPC can have large number of objects and logstash does not perform well in this case, it doesn't matter if it is historical data or not, and it is most related to the number of files, not their size.