Indexing twitter data since 2015 into elasticsearch

Hi all,I have a document-oriented database (JSON files) Twitter tweets on a Bank since 2015, can you suggest me a method of modeling this data under elasticsearch, so that my queries will be easy and my data will have a structuring Optimal.

The question is a little bit too broad. If you have examples, it would be easier to help you.

Can you give us one example how a tweet currently looks like?

If they are already in JSON you could just index them with Logstash into Elasticsearch. Tell us a little bit more about your source please.

You may also add your requirements here, but basically I would appreciate it what you have tried so far or at what specific point you can not progress.

There is also an article from the Elastic Evangelist http://david.pilato.fr/blog/2015/06/01/indexing-twitter-with-logstash-and-elasticsearch/

1 Like

the database I have in the form of several files JSON from Twitter since 2015 until 2019, I have not used logstash I received the raw data from my boss, I will show you an example of a Tweet that I have crawler on the Bank of which I do my internship, my question is according to this example how I can index this number of JSON file to have a good structuring on elasticsearch, or how can i store them under elasticsearch in an optimal way to facilitate the search in Tweets.

this is an exemple:
indent preformatted text by 4 spaces
{
"contributors": null,
"coordinates": null,
"created_at": "Mon Mar 25 09:15:00 +0000 2019",
"entities": {
"hashtags": [
{
"indices": [
0,
9
],
"text": "FIAD2019"
}
],
"symbols": ,
"urls": [
{
"display_url": "twitter.com/i/web/status/1\u2026",
"expanded_url": "https://twitter.com/i/web/status/1110107807783755776",
"indices": [
117,
140
],
"url": "https://t.co/C0JtE7wLYc"
}
],
"user_mentions":
},
"favorite_count": 0,
"favorited": false,
"geo": null,
"id": 1110107807783755776,
"id_str": "1110107807783755776",
"in_reply_to_screen_name": null,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"is_quote_status": false,
"lang": "fr",
"metadata": {
"iso_language_code": "fr",
"result_type": "recent"
},
"place": null,
"possibly_sensitive": false,
"retweet_count": 0,
"retweeted": false,
"source": "<a href="https://swello.com/fr/" rel="nofollow">Swello",
"text": "#FIAD2019 / Mouna KADIRI, directrice du Club Afrique D\u00e9veloppement de AWB : \u00ab Notre philosophie est de d\u00e9ployer une\u2026 https://t.co/C0JtE7wLYc",
"truncated": true,
"user": {
"contributors_enabled": false,
"created_at": "Sat Jun 27 17:38:59 +0000 2009",
"default_profile": true,
"default_profile_image": false,
"description": "Directeur de https://t.co/N0Bu1bmlDn (AP.P) Ancien r\u00e9dac chef La Tribune Hebdo, Courrier International, Paris Diplomatie, VoxLatina\u2026",
"entities": {
"description": {
"urls": [
{
"display_url": "africapresse.paris",
"expanded_url": "http://africapresse.paris",
"indices": [
13,
36
],
"url": "https://t.co/N0Bu1bmlDn"
}
]
},
"url": {
"urls": [
{
"display_url": "africapresse.paris",
"expanded_url": "https://www.africapresse.paris/",
"indices": [
0,
23
],
"url": "https://t.co/sfT9zVs6AT"
}
]
}
},
"favourites_count": 3262,
"follow_request_sent": false,
"followers_count": 770,
"following": false,
"friends_count": 620,
"geo_enabled": false,
"has_extended_profile": false,
"id": 51500496,
"id_str": "51500496",
"is_translation_enabled": false,
"is_translator": false,
"lang": "fr",
"listed_count": 65,
"location": "Paris",
"name": "Alfred Mignot (AP.P)",
"notifications": false,
"profile_background_color": "C0DEED",
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_tile": false,
"profile_image_url": "http://pbs.twimg.com/profile_images/286712308/alm75_normal.jpg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/286712308/alm75_normal.jpg",
"profile_link_color": "1DA1F2",
"profile_sidebar_border_color": "C0DEED",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"protected": false,
"screen_name": "alfredmignot",
"statuses_count": 12154,
"time_zone": null,
"translator_type": "none",
"url": "https://t.co/sfT9zVs6AT",
"utc_offset": null,
"verified": false
}
}
indent preformatted text by 4 spaces

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

thank you so much i updated my post

Could you format the data properly, please? In general I think you have to tell us what information is important to you or which information you want to keep. Elasticsearch could store the data 1:1 but it might have little use to you. A little bit of clarification what motivation or objectives would help us understand your problem.

Try to use logstash if you have the json files.

  • input => file => codec => json
  • output => elasticsearch

Docs

my question is : can i do this architecture knowing that i have integrated data in the form of JSON file

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.