Hi all,
Trying to import tweet data from Twitter's recently released archives and running into issues with date fields. The goal is to convert the tweet_time field from the CSV as @timestamp to be used as time filter field. The field gets loaded as text/string instead of date. The account_creation_date is recognised correctly.
Data source:
https://about.twitter.com/en_us/values/elections-integrity.html#data
("Tweet information" link under Internet Research Agency)
Sample values & format from tweet_date column:
2014-07-24 19:20
2014-11-22 15:28
2017-03-13 22:08
For reference, account_creation_date field is formatted as follows (and is recognised as date, even without me explicitly defining its format):
2012-02-25
2014-04-29
2013-12-07
logstash config:
input {
file {
path => "/Users/gid/Downloads/test.csv" # 24-row test sample
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
columns => ['tweetid', 'userid', 'user_display_name', 'user_screen_name',
'user_reported_location', 'user_profile_description',
'user_profile_url', 'follower_count', 'following_count',
'account_creation_date', 'account_language', 'tweet_language',
'tweet_text', 'tweet_time', 'tweet_client_name', 'in_reply_to_tweetid',
'in_reply_to_userid', 'quoted_tweet_tweetid', 'is_retweet',
'retweet_userid', 'retweet_tweetid', 'latitude', 'longitude',
'quote_count', 'reply_count', 'like_count', 'retweet_count', 'hashtags',
'urls', 'user_mentions', 'poll_choices']
id => "tweetid"
convert => {
"follower_count" => "integer"
"following_count" => "integer"
"is_retweet" => "boolean"
"quote_count" => "integer"
"reply_count" => "integer"
"like_count" => "integer"
"retweet_count" => "integer"
#"account_creation_date" => "date"
"tweet_time" => "date_time"
}
}
date {
match => ["tweet_time", "yyyy/MM/dd HH:mm"]
target => "tweet_time"
}
}
output {
elasticsearch {
hosts => "localhost"
index => "iratweetsv1"
document_type => "tweet"
}
}
How do I configure this correctly? I've tried many different setups but have not been able to parse these dates correctly. According to the mapping, this field is text:
"tweet_time": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
As opposed to account_creation_date:
"account_creation_date": {
"type": "date"
}
EDIT: Running ES, Kibana and logstash version 6.5.3