Importing tweets CSV date parsing


(Gid) #1

Hi all,

Trying to import tweet data from Twitter's recently released archives and running into issues with date fields. The goal is to convert the tweet_time field from the CSV as @timestamp to be used as time filter field. The field gets loaded as text/string instead of date. The account_creation_date is recognised correctly.

Data source:
https://about.twitter.com/en_us/values/elections-integrity.html#data
("Tweet information" link under Internet Research Agency)

Sample values & format from tweet_date column:
2014-07-24 19:20
2014-11-22 15:28
2017-03-13 22:08

For reference, account_creation_date field is formatted as follows (and is recognised as date, even without me explicitly defining its format):
2012-02-25
2014-04-29
2013-12-07

logstash config:

input {
      file {
          path => "/Users/gid/Downloads/test.csv" # 24-row test sample
          start_position => "beginning"
          sincedb_path => "/dev/null"
      }
}

filter {
    csv {
        separator => ","

        columns => ['tweetid', 'userid', 'user_display_name', 'user_screen_name',
       'user_reported_location', 'user_profile_description',
       'user_profile_url', 'follower_count', 'following_count',
       'account_creation_date', 'account_language', 'tweet_language',
       'tweet_text', 'tweet_time', 'tweet_client_name', 'in_reply_to_tweetid',
       'in_reply_to_userid', 'quoted_tweet_tweetid', 'is_retweet',
       'retweet_userid', 'retweet_tweetid', 'latitude', 'longitude',
       'quote_count', 'reply_count', 'like_count', 'retweet_count', 'hashtags',
       'urls', 'user_mentions', 'poll_choices']

        id => "tweetid"

        convert => {
            "follower_count" => "integer"
            "following_count" => "integer"
            "is_retweet" => "boolean"
            "quote_count" => "integer"
            "reply_count" => "integer"
            "like_count" => "integer"
            "retweet_count" => "integer"
            #"account_creation_date" => "date"
            "tweet_time" => "date_time"
        }
    }

    date {
        match => ["tweet_time", "yyyy/MM/dd HH:mm"]
        target => "tweet_time"
    }
}

output {
    elasticsearch {
        hosts => "localhost"
        index => "iratweetsv1"
        document_type => "tweet"
    }
}

How do I configure this correctly? I've tried many different setups but have not been able to parse these dates correctly. According to the mapping, this field is text:

"tweet_time": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }

As opposed to account_creation_date:

"account_creation_date": {
          "type": "date"
        }

EDIT: Running ES, Kibana and logstash version 6.5.3


(Bryan Stuhlsatz) #2

Where did u get the mapping? Did you create the mapping for your index which specifies tweet_time to be a text field?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.