How to capture text from a path or s3 prefix

I have a slightly different requirement where i have a directory structure something like this

/data/aws-test1
/data/aws-test2 .. and so on

Is there a way when using the file input template to capture aws-test1 and aws-test2 and use that name as the name of the index when it tries to ingest files from that directory.

I am using something like this as of now.

input {
file {
       path => "/data/**/*.gz"
       codec => "cloudtrail"
       start_position => "beginning"
       sincedb_path => "/usr/share/logstash/.sincedb"
        sincedb_path => "/var/lib/logstash/.sincedb"
       type => "cloudtrail"
        max_open_files => "1024"
  }
}

filter {
   grok {
       match => {"path" => "(/data/?<tstmp>\S+)/.*"}
   }
}

output {
  stdout { codec => rubydebug }

  elasticsearch {
        hosts => ["xx.xx.xx.xx:9200"]
        index => "%{[tstmp]}-%{+YYYY-MM}"
   }
}

Is this even possible to do.?

--
Niraj

Change

"(/data/?<tstmp>\S+)/.*"

to

"^/data/(?<tstmp>[^/]+)/"

but otherwise it should be fine.

Hi @magnusbaeck,

yellow open %{[tstmp]}-2017-07 yAWsiuiLQzGGADBNJWP3hw 5 1  7 0 128.5kb 128.5kb
yellow open %{[tstmp]}-2017-08 5XJUW7ejTgmeQSkdFll_Xg 5 1 23 0 254.7kb 254.7kb

This is what i get when the index gets created.

--
Niraj

@magnusbaeck Seems like there was a type in my copy paste. It works now.

@magnusbaeck
One more problem i have now is ingesting the *.gz files of aws cloudtrails. As these gzip files have json inside it which does not have a new line , Logstash thinks that it is expecting more input as the file has no end to it.

logstash.inputs.file - each: file grew --> This is what i see and it goes on and on forever.

Is there a way to overcome this?

The file input expects log entries to end with a newline character. I don't think there's a workaround for this.

Thanks @magnusbaeck for the feedback. So what i have right now is like pre-process the s3 files , untar it, add a new line to every file with a script and let remain the existing files ( *.gz ) so i can revert back if required to gz types. The ingestion is working fine now with few of the fields showing in kibana as not recognized but i believe those are due to mapping issues.

@magnusbaeck I think i am stuck again. I am flipping back and forth between input file and input s3 to compare performance and the hassle of processing data. Here is my config with the pattern to grab for index creation. But somehow it is not able to grab the output i want.

input {
  s3 {
    type => "cloudtrail"
    bucket => "xxxxxxxxxxxxxxxx"
    prefix => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    backup_to_dir => "/etc/s3backup/"
    add_field => { source => gzfiles }
    codec => cloudtrail {}
    region => "us-east-1"
    access_key_id => "xxxxxxxxxxxxxxxxxxxx"
    secret_access_key => "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    sincedb_path => "/etc/s3backup/sincedb"
  }
}

filter {
   grok {
       match => {"prefix" => "^AWSLogs/(?<tstmp>[^/]+)/"}
   }
 }

output {
  stdout { codec => rubydebug }
  elasticsearch {
       	index => "%{[tstmp]}-%{+YYYY-MM}"
        hosts => ["xxxxxxxxxxxxx:9200"]
   }
}

I want names after AWSLogs which is generally the account name for aws to be used as my index name but the index that are created is having

%{[tstmp]}-2017-02 as the index name. What am i doing wrong here?

--
Niraj

Is there a prefix field? Show an example event produced by Logstash.

@magnusbaeck i was in an assumption that i can extract from the "prefix" variable that is being used in the s3 input. Now i feel that it has to be an actual field. If this is the issue, is there a way i can grab the value what i see in "prefix" variable in input. The reason i want this is because there are no other fields in cloudtrail events that can give me this value.

Let me know your view or any suggestion.

Attached is the sample cloudtrail event

{
               "eventID" => "1fc18942-bf47-41e2-b20d-b09fbcaa7acb",
             "awsRegion" => "us-east-1",
          "eventVersion" => "1.05",
      "responseElements" => {
        "assumedRoleUser" => {
            "assumedRoleId" => "xxxxxxxxxxxxxxx",
                      "arn" => "xxxxxxxxxxxxxxxxxxxxxxxxxx"
        },
            "credentials" => {
             "accessKeyId" => "xxxxxxxxxxxxxxxxxxx",
            "sessionToken" => "xxxxxxxxxxxxxxxxxxxx",
              "expiration" => "Feb 21, 2017 1:00:02 AM"
        }
    },
       "sourceIPAddress" => "xxxxxxxxxxxxxxxx",
           "eventSource" => "sts.amazonaws.com",
     "requestParameters" => {
                "roleArn" => "xxxxxxxxxxxxxxxxxx",
        "roleSessionName" => "xxxxxxxxxxxxxxx"
    },
             "resources" => [
        [0] {
            "accountId" => "xxxxxxxxxxxxx",
                 "type" => "AWS::IAM::Role",
                  "ARN" => "xxxxxxxxxxxxxxxxxxxx"
        }
    ],
             "userAgent" => "Boto/2.43.0 Python/2.7.3 Linux/3.2.0-119-generic",
          "userIdentity" => {
        "accessKeyId" => "xxxxxxxxxxxxxxxxx",
          "accountId" => "xxxxxxxxxxxxx",
        "principalId" => "xxxxxxxxxxxxxxxxx",
               "type" => "IAMUser",
                "arn" => "xxxxxxxxxxxxxxx",
           "userName" => "xxxxxxxxxxxxxx"
    },
             "eventType" => "AwsApiCall",
                "source" => "gzfiles",
                  "type" => "cloudtrail",
                  "tags" => [
        [0] "_grokparsefailure"
    ],
            "@timestamp" => 2017-02-21T00:00:02.000Z,
         "sharedEventID" => "2a9b9a58-6d9b-4d66-873c-ae3e08a6f4f6",
             "requestID" => "b1c387b7-f7c8-11e6-a1b7-a52c9bd268c6",
              "@version" => "1",
             "eventName" => "AssumeRole",
    "recipientAccountId" => "xxxxxxxxxxxxx"
}

If this is the issue, is there a way i can grab the value what i see in “prefix” variable in input.

No, but you can use add_field in the input to add whatever field you like.

input {
  s3 {
    type => "cloudtrail"
    bucket => "xxxxxxxxxxxxxxxx"
    prefix => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    ...
    add_field => {
      "prefix" => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    }
  }
}

@magnusbaeck My bad for not explaining this correctly. Apologies.

This was just for testing so i included the complete path to narrow down the ingestion speed.

Basically i have different AWS Accounts under AWSLogs directory.

Something like below

AWSLogs/aws-test/
AWSLogs/aws-dev /

and so on. and each one of them should be treated as single index as they belong to different account. Will this trick still work?

--
Niraj

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.