How to capture text from a path or s3 prefix

niraj_kumar · August 6, 2017, 1:58am

I have a slightly different requirement where i have a directory structure something like this

/data/aws-test1
/data/aws-test2 .. and so on

Is there a way when using the file input template to capture aws-test1 and aws-test2 and use that name as the name of the index when it tries to ingest files from that directory.

I am using something like this as of now.

input {
file {
       path => "/data/**/*.gz"
       codec => "cloudtrail"
       start_position => "beginning"
       sincedb_path => "/usr/share/logstash/.sincedb"
        sincedb_path => "/var/lib/logstash/.sincedb"
       type => "cloudtrail"
        max_open_files => "1024"
  }
}

filter {
   grok {
       match => {"path" => "(/data/?<tstmp>\S+)/.*"}
   }
}

output {
  stdout { codec => rubydebug }

  elasticsearch {
        hosts => ["xx.xx.xx.xx:9200"]
        index => "%{[tstmp]}-%{+YYYY-MM}"
   }
}

Is this even possible to do.?

--
Niraj

magnusbaeck · August 6, 2017, 8:23am

Change

"(/data/?<tstmp>\S+)/.*"

to

"^/data/(?<tstmp>[^/]+)/"

but otherwise it should be fine.

niraj_kumar · August 6, 2017, 8:08pm

Hi @magnusbaeck,

yellow open %{[tstmp]}-2017-07 yAWsiuiLQzGGADBNJWP3hw 5 1  7 0 128.5kb 128.5kb
yellow open %{[tstmp]}-2017-08 5XJUW7ejTgmeQSkdFll_Xg 5 1 23 0 254.7kb 254.7kb

This is what i get when the index gets created.

--
Niraj

niraj_kumar · August 6, 2017, 8:12pm

@magnusbaeck Seems like there was a type in my copy paste. It works now.

niraj_kumar · August 6, 2017, 9:34pm

@magnusbaeck
One more problem i have now is ingesting the *.gz files of aws cloudtrails. As these gzip files have json inside it which does not have a new line , Logstash thinks that it is expecting more input as the file has no end to it.

logstash.inputs.file - each: file grew --> This is what i see and it goes on and on forever.

Is there a way to overcome this?

magnusbaeck · August 7, 2017, 5:25am

The file input expects log entries to end with a newline character. I don't think there's a workaround for this.

niraj_kumar · August 7, 2017, 5:50am

Thanks @magnusbaeck for the feedback. So what i have right now is like pre-process the s3 files , untar it, add a new line to every file with a script and let remain the existing files ( *.gz ) so i can revert back if required to gz types. The ingestion is working fine now with few of the fields showing in kibana as not recognized but i believe those are due to mapping issues.

niraj_kumar · August 23, 2017, 4:42pm

@magnusbaeck I think i am stuck again. I am flipping back and forth between input file and input s3 to compare performance and the hassle of processing data. Here is my config with the pattern to grab for index creation. But somehow it is not able to grab the output i want.

input {
  s3 {
    type => "cloudtrail"
    bucket => "xxxxxxxxxxxxxxxx"
    prefix => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    backup_to_dir => "/etc/s3backup/"
    add_field => { source => gzfiles }
    codec => cloudtrail {}
    region => "us-east-1"
    access_key_id => "xxxxxxxxxxxxxxxxxxxx"
    secret_access_key => "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    sincedb_path => "/etc/s3backup/sincedb"
  }
}

filter {
   grok {
       match => {"prefix" => "^AWSLogs/(?<tstmp>[^/]+)/"}
   }
 }

output {
  stdout { codec => rubydebug }
  elasticsearch {
       	index => "%{[tstmp]}-%{+YYYY-MM}"
        hosts => ["xxxxxxxxxxxxx:9200"]
   }
}

I want names after AWSLogs which is generally the account name for aws to be used as my index name but the index that are created is having

%{[tstmp]}-2017-02 as the index name. What am i doing wrong here?

--
Niraj

magnusbaeck · August 24, 2017, 2:03pm

Is there a prefix field? Show an example event produced by Logstash.

niraj_kumar · August 24, 2017, 3:49pm

@magnusbaeck i was in an assumption that i can extract from the "prefix" variable that is being used in the s3 input. Now i feel that it has to be an actual field. If this is the issue, is there a way i can grab the value what i see in "prefix" variable in input. The reason i want this is because there are no other fields in cloudtrail events that can give me this value.

Let me know your view or any suggestion.

Attached is the sample cloudtrail event

{
               "eventID" => "1fc18942-bf47-41e2-b20d-b09fbcaa7acb",
             "awsRegion" => "us-east-1",
          "eventVersion" => "1.05",
      "responseElements" => {
        "assumedRoleUser" => {
            "assumedRoleId" => "xxxxxxxxxxxxxxx",
                      "arn" => "xxxxxxxxxxxxxxxxxxxxxxxxxx"
        },
            "credentials" => {
             "accessKeyId" => "xxxxxxxxxxxxxxxxxxx",
            "sessionToken" => "xxxxxxxxxxxxxxxxxxxx",
              "expiration" => "Feb 21, 2017 1:00:02 AM"
        }
    },
       "sourceIPAddress" => "xxxxxxxxxxxxxxxx",
           "eventSource" => "sts.amazonaws.com",
     "requestParameters" => {
                "roleArn" => "xxxxxxxxxxxxxxxxxx",
        "roleSessionName" => "xxxxxxxxxxxxxxx"
    },
             "resources" => [
        [0] {
            "accountId" => "xxxxxxxxxxxxx",
                 "type" => "AWS::IAM::Role",
                  "ARN" => "xxxxxxxxxxxxxxxxxxxx"
        }
    ],
             "userAgent" => "Boto/2.43.0 Python/2.7.3 Linux/3.2.0-119-generic",
          "userIdentity" => {
        "accessKeyId" => "xxxxxxxxxxxxxxxxx",
          "accountId" => "xxxxxxxxxxxxx",
        "principalId" => "xxxxxxxxxxxxxxxxx",
               "type" => "IAMUser",
                "arn" => "xxxxxxxxxxxxxxx",
           "userName" => "xxxxxxxxxxxxxx"
    },
             "eventType" => "AwsApiCall",
                "source" => "gzfiles",
                  "type" => "cloudtrail",
                  "tags" => [
        [0] "_grokparsefailure"
    ],
            "@timestamp" => 2017-02-21T00:00:02.000Z,
         "sharedEventID" => "2a9b9a58-6d9b-4d66-873c-ae3e08a6f4f6",
             "requestID" => "b1c387b7-f7c8-11e6-a1b7-a52c9bd268c6",
              "@version" => "1",
             "eventName" => "AssumeRole",
    "recipientAccountId" => "xxxxxxxxxxxxx"
}

magnusbaeck · August 24, 2017, 8:21pm

If this is the issue, is there a way i can grab the value what i see in “prefix” variable in input.

No, but you can use add_field in the input to add whatever field you like.

input {
  s3 {
    type => "cloudtrail"
    bucket => "xxxxxxxxxxxxxxxx"
    prefix => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    ...
    add_field => {
      "prefix" => "AWSLogs/xxxxxxxxxxxxx/CloudTrail/us-east-1/2017/02/21/"
    }
  }
}

niraj_kumar · August 24, 2017, 8:27pm

@magnusbaeck My bad for not explaining this correctly. Apologies.

This was just for testing so i included the complete path to narrow down the ingestion speed.

Basically i have different AWS Accounts under AWSLogs directory.

Something like below

AWSLogs/aws-test/
AWSLogs/aws-dev /

and so on. and each one of them should be treated as single index as they belong to different account. Will this trick still work?

--
Niraj

system · September 21, 2017, 8:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Suggestion for AWS Cloudtrail Logging! Logstash	1	505	March 12, 2017
Read aws cloud trail log from s3 bucket Logstash	2	2569	July 6, 2017
Logstash S3 input plugin - prefix wildcard Logstash	2	1148	September 28, 2017
Create index in elasticsearch for a file in S3 bucket Elasticsearch	5	2793	July 5, 2017
Input S3 does NOT work properly with prefix option Logstash	3	1887	September 23, 2019

How to capture text from a path or s3 prefix

Related topics