Selecting values from JSON

Hello.
I am parsing twitter with following settings:

input {
twitter {
consumer_key => "XX"
consumer_secret => "XX"
oauth_token => "XX"
oauth_token_secret => "XX"
full_tweet => true
use_samples => true
languages => ["en", "de"]
}
}

output {
elasticsearch {
hosts => ["10.0.20.51:9200"]
index => "tweets-%{+YYYY.MM.dd}"
}
}

I do not need the massive json with more than 900 fields being in my ES.

For example:
~{
"_index": "tweets-2018.07.24",
"_type": "doc",
"_id": "wE6hzGQB2mGdQWLhJvXj",
"_version": 1,
"_score": null,
"_source": {
"entities": {
"hashtags": [],
"urls": [
{
"expanded_url": "https://twitter.com/i/web/status/1021759356671598592",
"display_url": "twitter.com/i/web/status/1…",
"url": "https://t.co/TVqHFnvmUG",
"indices": [
117,
140
]
}
],
"user_mentions": [],
"symbols": []
},
"text": "En todo lo que va del año hasta ahora entraba a gim 10.20 pensando que era ese el horario (y encima llegaba tarde),… https://t.co/TVqHFnvmUG",
"in_reply_to_user_id_str": null,
"extended_tweet": {
"full_text": "En todo lo que va del año hasta ahora entraba a gim 10.20 pensando que era ese el horario (y encima llegaba tarde), hoy me enteré que entrábamos a las 11🤦",
"display_text_range": [
0,
154
],
"entities": {
"hashtags": [],
"urls": [],
"user_mentions": [],
"symbols": []
}
},
"quote_count": 0,
"geo": null,
"timestamp_ms": "1532441388658",
"@timestamp": "2018-07-24T14:09:48.000Z",
"favorited": false,
"reply_count": 0,
"truncated": true,
"contributors": null,
"in_reply_to_status_id_str": null,
"place": null,
"lang": "es",
"is_quote_status": false,
"@version": "1",
"retweet_count": 0,
"favorite_count": 0,
"source": "<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android",
"filter_level": "low"
}

How can I extract only following fields:
"@timestamp":
"lang":
etc.

using filter?

filter {
json {
source => "@timestamp"
}
}

It is so confusing for me.
If anyone could point me to the right place, or show how to filter the given fields, would be amazing.
Thanks!

I would use mutate+remove_field to remove the unwanted fields.

Have a look at the prune filter.

Hello, @magnusbaeck thanks for pointing the prune filter.

first I have installed it:
./logstash-plugin install logstash-filter-prune

I have managed to prototype the filter:

filter {
prune {
whitelist_names => [
"entities.hashtags.text",
"entities.user_mentions.name",
"entities.user_mentions.id",
"lang",
"coordinates",
"retweeted_status.entities.hashtags.text",
"retweeted_status.entities.user_mentions.name",
"retweeted_status.entities.user_mentions.id",
"text",
"extended_tweet.full_text",
"extended_tweet.entities.hashtags.text",
"extended_tweet.entities.urls.url",
"extended_tweet.entities.urls.expanded_url",
"@timestamp"
]
}
}

The JSON that I am parsing is:
https://pastebin.com/n69pHb7H

However, I am accessing only not nested objects:

/kibana output
{
"lang": "en",
"coordinates": null,
"text": "RT @liamosaur: I just heard mansplaining referred to as "correctile dysfunction" and I'm pretty shook :rofl::rofl::rofl:",
"@timestamp": "2018-07-26T08:03:23.000Z"
},
"fields": {
"@timestamp": [
"2018-07-26T08:03:23.000Z"
]
},
"sort": [
1532592203000
]
}

I am working on this in the background.
If anyone have have idea how to access those nested objects, please comment.

At the moment I am doing following:
filter {
prune {
whitelist_names => [
"^entities$",
"^lang$",
"^coordinates$",
"^retweeted_status$",
"^text$",
"^extended_tweet$",
"^@timestamp$",
"^user$"
]
}
}

The reason to figure out the very accurate filtering is to limit the data load into ES cluster per day to the absolute minimum.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.