Preprocess Tweets before indexation

I want to index tweets using logstash and elasticsearch. I am using the twitter plugin. Now the tweet that are sent contain too many useless information for my application. I would like to select only some fields and rename/make flat others and eventually divide one document in two distinct documents. For instance, let's say that this is a Twitter answer:

{
    "tweet": {
       "tweetId": 1025,
       "tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
       "hashtags": ["stackOverflow", "elasticsearch"],
       "publishedAt": "2017 23 August",
       "analytics": {
           "likeNumber": 400,
           "shareNumber": 100,
       }
    },
    "author":{
       "authorId": 819744,
       "authorAt": "the_expert",
       "authorName": "John Smith",
       "description": "Haha it's a fake description"
    }
}

Now I want to generate the following two documents:

# indexed in twitter/tweet/1025 The id for this document should be the one from tweetId `"tweetId": 1025`
{
    "content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
    "hashtags": ["stackOverflow", "elasticsearch"],
    "date": "2017/08/23", # the date has been formated
    "shareNumber": 100 # This field has been flattened
}

And the second document would be:

# Indexed in twitter/author/819744  The id for this document should be the one from authorId `"authorId": 819744 `
{
   "authorAt": "the_expert",
   "description": "Haha it's a fake description"
}

Is it possible? How can I do so?

Use the clone filter to, well, clone the original event in two. Then use whatever filters you need to process each event. Wrap your filters in conditionals so that you use one set of filters on the original event and another set of filters on the clone.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.