Errors with Logstash bulk load using CSV


(Chad Burkins) #1

I'm using Logstash to bulk load into ElasticSearch using a CSV template

I'm getting lots of warnings like:

retrying failed action with response code: 429 {:level=>:warn}

And lots of errors like:

too many attempts at sending event. dropping: 2015-06-16T01:06:22.000Z ip-172-31-30-197 93522,itsmxjunbu01,0,Backup,chile.eesus.jnj.com,"Jun 15, 2015  {:level=>:error} 16, 2015 1:06:22 AM",03:05:28,32,0,3

My command to load the data looks like this:

cat data.csv | /opt/logstash/bin/logstash -f ./csvload.conf

The csvload.conf looks like this:

# read input from stdin (e.g. pipe)
input {
    stdin {}
}

filter {
   # filter the input by csv (i.e. comma-separated-value)
   csv {
       columns => [
           "JobID",
           "ServerName",
           "StatusCode",
           "JobType",
           "ClientName",
           "StartTime",
           "EndTime",
           "Duration",
           "Volume-KB",
           "NumberofFiles",
           "Throughput-KB-sec"
       ]
   }

    date {
        # parse the "End Time" to create a real date
        # Examples of times in this log file
        # "May 29, 2015 10:00:01 PM"
        # "May 9, 2015 4:47:23 AM"
        # "May 23, 2015 12:23:49 PM"
        match => [ "EndTime",
                   "MMM dd, YYYY hh:mm:ss aa",
                   "MMM  d, YYYY hh:mm:ss aa" ] }

    mutate { replace => { "type" => "nbu_job" } }
    mutate { gsub => ["NumberofFiles", ",", ""] }
    mutate { convert => [ "NumberofFiles", "integer" ] }
    mutate { gsub => ["Volume-KB", ",", ""] }
    mutate { convert => [ "Volume-KB", "integer" ] }
    mutate { gsub => ["Throughput-KB-sec", ",", ""] }
    mutate { convert => [ "Throughput-KB-sec", "integer" ] }

    # Example of Duration = "04:28:13" which is hours, minutes, and seconds
    # Split up and create the respective integer fields
    grok {
        match => [ "Duration", "%{NUMBER:hours:int}:%{NUMBER:minutes:int}:%{NUMBER:seconds:int}" ]
    }

    # Call ruby to perform the basic arithmetic of computing total seconds
    ruby {
        code => "event['Elapsed'] = event['hours']*3600 + event['minutes']*60 + event['seconds']"
    }


    translate {
        field => "ServerName"
        destination => "Country"
        dictionary_path => "./cmdb/ServerByCountry.yaml"
        fallback => "Unknown"
    }
}

# send the output to stdout, using the rubydebug codec
# rubydedug uses the Ruby Awesome Print library
output {
#    stdout { codec => rubydebug }
    elasticsearch { host => localhost   }
}

I honestly really don't care about the csv import performance. I wish I could figure out a way to slow down the load. Seems like logstash and/or elasticsearch is choking....

-Chad


(Mark Walkom) #2

How big is the file?
What sort of specs is the ES host?
Have you checked things like threadpool rejections?


(Chad Burkins) #3

Thanks for the reply, Mark.

I've got five different input files, all with similar data, ranging in size from 600K lines to 1.7M lines. Strangely, the input file with 1.7M lines ran just fine, but the "smaller" file with 600K lines produces the warnings and errors.

The ES host is an AWS t2.medium (2 vCPU's and 4GB of RAM)

I'm still new to this (but enjoying it very much), so I'm not sure what you mean by threadpool rejections. I'm guessing "threadpool" is a resource define within elasticsearch.yml ?

-Chad


(Christian Dahlqvist) #4

t2 instances are burstable performance instances and may not be ideal for long sustained bulk loading. In order to avoid this limitation you could perform the bulk load locally or on a separate, more powerful instance, and then use snapshot and restore, possibly from S3 using the AWS plugin to move the indexed data to your instance.


(system) #5