Java with twitter4j and send info to elasticsearch


#1

Good evening,
I recently started exploring elasticsearch, as I'm doing a job in java to analyze sentiments in tweets.

Initially I was just using java and the tweet4j API. Then I heard about ElasticSearch and started exploring, also using logstash and kibana.

My question is: Is it possible in Java using twitter4j, do tweets search and then send these to the elasticsearch?

I have a Maven project and I followed the dependencies in the elastic search page, more properly

<Dependency>
    <GroupId> org.elasticsearch.client </ groupId>
    <ArtifactId> transport </ artifactId>
    <Version> 5.5.0 </ version>
</ Dependency>

And also the log4j 2 logger, but when executing the following code:

TransportClient client = new PreBuiltTransportClient (Settings.EMPTY)
        .addTransportAddress (new InetSocketTransportAddress (InetAddress.getByName ("host1"), 9300))
        .addTransportAddress (new InetSocketTransportAddress (InetAddress.getByName ("host2"), 9300));

TransportClient is not recognized.

Can someone help me?


(David Pilato) #2

If you want to go straight ahead to the point of injecting Twitter data into elasticsearch, I'd simply use Logstash and its Twitter input.

I wrote something about it here: http://david.pilato.fr/blog/2015/06/01/indexing-twitter-with-logstash-and-elasticsearch/

HTH


#3

Thanks for the answer. I have already read this tutorial(its very nice
cngtz) and can already use the ELK to get data and present the same. The
problem is that the final work will be to integrate into a JAVA application
and what I was proposed is if possible, to use java to search for tweets on
a certain date and with a certain word and going to store the results in
something like elasticsearch. As the elastic page itself has something
about integrating into java, I decided to follow but so far without
success. Do not you know of an article that I can read about it? Thank you


(David Pilato) #4

Ok. So let's go back to your original problem then.

1st of all I'd really recommend using the rest client instead of the transport client.
The transport client will be deprecated soonish.

Then, your maven example is incorrect. Uppercase,/Lowercase, spaces in tags...

TransportClient is not recognized.

That might be the cause...


#5

Thanks for the answer.

In my code I think I have the correct tags and I am no longer getting errors.

However I have some questions, when I use logstash to get twitter data through elasticsearch, I define a name for the index,

output {
	stdout { codec => dots }
	#stdout { codec => rubydebug }
	#stdout { codec => json }
	elasticsearch {
		hosts => "localhost:9200"
		index => "gameot2"
		document_type => "tweet"
		template => "twitter_template.json"
		template_name => "twitter"
  }

does this index correspond to a node in JAVA documentation? And whenever I use a different index, is a new node created in the same Cluster?

I ask this because I'm having trouble noticing what I have to use in the following fields:

1- "host1" and "host2"

TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
		        .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))
		        .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));
		#2 - SearchResponse response = client.prepareSearch("gameot2").get();
		#2 -System.out.println(response);
		client.close();

In #2 What I'm trying to do is get the elasticsearch data saved with the index "gameot2".

2- "cluster.name" and "myClusterName".

Settings settings = Settings.builder()
        .put("cluster.name", "myClusterName").build();
TransportClient client = new PreBuiltTransportClient(settings);

I would like to try to understand this part before following your advice and studying the rest client.
Thank you


(David Pilato) #6

No. A node is an elasticsearch instance (a jvm process basically).
An index is where you logically want to store/index your data. An index is split into shards (a shard is a working unit, a Lucene instance behind the scenes).
A shard is located within a node (stored on the hard disk of this node).

  1. your code looks correct

  2. I believe you need to provide the same cluster name that your node is using. Defaults to elasticsearch


#7

No. A node is an elasticsearch instance (a jvm process basically).
An index is where you logically want to store/index your data. An index is split into shards (a shard is a working unit, a Lucene instance behind the scenes).
A shard is located within a node (stored on the hard disk of this node).

Thanks for the info, it helped to realize.

  1. your code looks correct

(new InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))

public static InetAddress getByName(String host)
                             throws UnknownHostException

Determines the IP address of a host, given the host's name.
The host name can either be a machine name, such as "java.sun.com", or a textual representation of its IP address. If a literal IP address is supplied, only the validity of the address format is checked.

In place of "host1" should not put something like the ip of the machine? or?

  1. I believe you need to provide the same cluster name that your node is using. Defaults to elasticsearch

Since I have never changed the cluster name, how can I know the default name of the cluster? Its "elasticsearch"?

Thankyou


(David Pilato) #8

In place of "host1" should not put something like the ip of the machine? or?

Yes. Or the qualified host name.

Since I have never changed the cluster name, how can I know the default name of the cluster? Its "elasticsearch"?

Yes. It's elasticsearch. So in that case, just start with:

TransportClient client = new PreBuiltTransportClient(Settings.EMPTY);

#9

Before continue, I may be doing a wrong job with the elastic folder path, so:

I downloaded elasticsearch-5.4.0 and got the java (maven) project in another folder.
Does the elasticsearch folder have to be placed in a specific directory or just the root of the java project?

Thank you for your time


(David Pilato) #10

Elasticsearch (the server) does not need to be placed in any specific dir. it can either run on another machine or on cloud.elastic.co.


#11

To test the trasportclient is it necessary to configure log4j?

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Following the elastic documentation on log4j, prompts you to create the log4j2.properties file in the path src/main/resources. I do not have the past resources, do I have to create it?

thank you


(David Pilato) #12

Yes.


#13

Thanks for the answers.
I was able to configure the trasnportclient in java and I can access the index stored in the elastic.

At this moment I am moving to the REST client and my questions are:

1 - Having an application in java to fetch data to twitter using twitter4j, with the restclient I'm going to be able to create an index and save the information in the esticsearch and then get the same?

2 - And when I unplug my application, will the cluster and its nodes be saved?

3 -Or will I have to use a database to store the information?

ty


(David Pilato) #14
  1. Yes
  2. Yes
  3. Read https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html#_data_store_recommendations

#15

Thank you.

Response response = restClient.performRequest("GET", "/",
    Collections.singletonMap("pretty", "true"));

Response performRequest(String method, String endpoint,
	                        Header... headers)

1- Here when doing this performRequst what are the "pretty" and "true" strings?
What are Headers and how do I know what to put?

//index a document
HttpEntity entity = new NStringEntity(
        "{\n" +
        "    \"user\" : \"kimchy\",\n" +
        "    \"post_date\" : \"2009-11-15T14:12:12\",\n" +
        "    \"message\" : \"trying out Elasticsearch\"\n" +
        "}", ContentType.APPLICATION_JSON);

2 -
In this case when creating an entity to send, I already have a JSONObject with all the data obtained from twitter (basically an array of type "twitter": [list of tweets]}, I can send that complete object for elasticsearch or I have to for every Tweet like this in the example and send one at a time?

ty


(David Pilato) #16
  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#_pretty_results
  2. I would send one tweet by one tweet.

#17

Thanks.

I can already send and receive information, so I have a doubt in the information handling as JSON.

//create a json to PUT at the ES
		Map<String, String> params2 = Collections.emptyMap();
		String jsonString = "{" +
		            "\"user\":\"Luisinho\"," +
		            "\"postDate\":\"1988\"," +
		            "\"message\":\"esta a começar a andar\"" +
		        "}";
		HttpEntity entity = new NStringEntity(jsonString, ContentType.APPLICATION_JSON);
		Response response4 = restClient.performRequest("PUT", "/posts/doc/2", params, entity);
		
		
		//Search the document using Query Params

		Map<String, String> paramMap = new HashMap<String, String>();
		paramMap.put("q", "user:Luisinho");
		paramMap.put("pretty", "true");
		                               
		Response response5 = restClient.performRequest("GET", "/posts/_search",
		                                                           paramMap);
		
		System.out.println(EntityUtils.toString(response5.getEntity()));
		System.out.println("Host -" + response5.getHost() );

And in response5 I get the following:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "posts",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "user" : "Luisinho",
          "postDate" : "1988",
          "message" : "esta a começar a andar"
        }
      }
    ]
  }
}

Host -http://localhost:9200

1- My goal will be to store and save all the values ​​of the key "message". Is it possible to receive this already in JSON format? how?

2- Is there a way to access these fields from the response5?

3- I'm trying to convert to JSON and look for work from that, but I would like to know if it's already possible to get JSON, so I'm not converting it

String nova=EntityUtils.toString(response5.getEntity());
JSONObject jsonObj = new JSONObject(nova);
String message = jsonObj.getJSONObject("hits").getString("message");
		System.out.println("JSON  "+message);

4-
To populate the elastic search, I also used a logstash file and left it for some time, getting over 5k hits.

But when I do in the browser http: // localhost: 9200 / index / _search it only returns me 9 hits.

Is there a limit? Or is it possible to get all the hits through the browser and performRequest?
thanks


(David Pilato) #18

I don't understand.

My goal will be to store and save all the values ​​of the key “message”.

An example?

I’m trying to convert to JSON

What are you trying to convert to JSON?


#19

My goal is to add tweet to a certain index in elasticsearch. And also have a method able to get all saved tweets of a certain index.

JSON of a tweet exemple:

{
  "took": 356,
  "timed_out": false,
  "_shards": {
"total": 5,
"successful": 5,
"failed": 0
  },
  "hits": {
"total": 45739,
"max_score": 1,
"hits": [
  {
    "_index": "super",
    "_type": "tweet",
    "_id": "AV3CvfikX9cr2kPs41mp",
    "_score": 1,
    "_source": {
      "extended_entities": {
        "media": [
          {
            "display_url": "pic.twitter.com/bBH0mIfqaZ",
            "indices": [
              80,
              103
            ],
            "sizes": {
              "small": {
                "w": 626,
                "h": 350,
                "resize": "fit"
              },
              "large": {
                "w": 626,
                "h": 350,
                "resize": "fit"
              },
              "thumb": {
                "w": 150,
                "h": 150,
                "resize": "crop"
              },
              "medium": {
                "w": 626,
                "h": 350,
                "resize": "fit"
              }
            },
            "id_str": "894963129897111555",
            "expanded_url": "https://twitter.com/EcuadorWillana/status/894963132338249728/photo/1",
            "media_url_https": "https://pbs.twimg.com/media/DGuMOZSW0AMY0s2.jpg",
            "id": 894963129897111600,
            "type": "photo",
            "media_url": "http://pbs.twimg.com/media/DGuMOZSW0AMY0s2.jpg",
            "url": "https://t.co/bBH0mIfqaZ"
          }
        ]
      },
      "in_reply_to_status_id_str": null,
      "in_reply_to_status_id": null,
      "created_at": "Tue Aug 08 16:46:53 +0000 2017",
      "in_reply_to_user_id_str": null,
      "source": "<a href=\"http://ecuadorwillana.com\" rel=\"nofollow\">EcuadorWillana</a>",
      "retweet_count": 0,
      "retweeted": false,
      "geo": null,
      "filter_level": "low",
      "in_reply_to_screen_name": null,
      "is_quote_status": false,
      "id_str": "894963132338249728",
      "in_reply_to_user_id": null,
      "@version": "1",
      "favorite_count": 0,
      "id": 894963132338249700,
      "text": "#RealMadrid y #United disputan hoy #SupercopaDeEuropa - https://t.co/LmsRUby0ww https://t.co/bBH0mIfqaZ",
      "place": null,
      "lang": "es",
      "favorited": false,
      "possibly_sensitive": false,
      "coordinates": null,
      "truncated": false,
      ....
      }
    }
  },

If you use the above code I will get all the fields of a tweet and what I want is to access a specific "key" which is the "text" getting its value.

1- How can I get a field specific to something like Response?

2- I am considering that it is possible to use the query match_all, I can get all the tweets of a given index and get the value of the key "text" of each one.

I spoke in JSON because I want to create a JSON object with this structure

EntityUtils.toString(response5.getEntity())

so that it is easier to access the specific key.

Can I explain myself better?
I'm very sorry, my English is bad and I'm new to this "world" too.


(David Pilato) #20

My goal is to add tweet to a certain index in elasticsearch. And also have a method able to get all saved tweets of a certain index.

When you say "certain index", do you mean that tweets will go to different indices?

But anyway, I think I understood with your example.

If you want to extract only one field from the _source, you can use stored_fields https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-stored-fields.html.

Otherwise, you can read the full JSon response as a Map (see for example https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/client/JsonUtil.java#L60), then read inside this map to find the hits.hits[0]._source.text field. Something similar to https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/client/ElasticsearchClient.java#L337-L350.

  1. If you want to extract all the existing documents, you can't use classic _search for that. But Scroll API: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-scroll.html
    But first, try to solve your first problem.

Not a problem. I'm not english native either so I probably need more examples to understand.

and I’m new to this “world” too

Welcome to the real world, Neo. :stuck_out_tongue: