Stream2es, Twitter River and changes to the mapping

Finn_Poitier · December 12, 2013, 8:05pm

Hello,

being new to elasticsearch and trying to get Twitter documents indexed, I
tried a few things, which again caused some questions. Let me explain:

First I used Stream2es, which pulled me Twitter documents in quickly, but I
didn´t saw any way to filter the tweets as it can be done with the Twitter
River.

Then I used the Twitter River, its filtering works as described. But when
checking the tweets content, I saw, there are less fields being indexed
than in Stream2es, for example user data such as followers_count
or friends_count.

So I wondered:

1.) Is there a way to do filtering with Stream2es?
2.) Is there a way to change the mapping of Twitter River? Well, I tried to
change the Twitter River mapping, just without success (I didn´t started to
index anything) the following way:

Creating a new index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/' -d '
index :
type : mynewmapping
bulk_size : 10
'

Applying a desired mapping to the index:

curl -XPUT
'http://localhost:9200/twitter_socialmedia_river/mynewmapping/_mapping' -d '
{
"mynewmapping" : {
"properties" : {
"user": {
"properties": {
"location": {"type": "string"},
"default_profile": {"type": "boolean"},
"statuses_count": {"type": "long"},
"lang": {"type": "string"},
"id": {"type": "long"},
"favourites_count": {"type": "long"},
and some other properties hereafter...
}
}
}
}
}
'

Starting the Twitter River:

curl -XPUT localhost:9200/_river/twitter_socialmedia_river/_meta -d '
{
"type" : "mynewmapping",
"twitter" : {
"oauth" : {
"consumer_key" : "myValues",
"consumer_secret" : "myValues",
"access_token" : "myValues",
"access_token_secret" : "myValues"
},
"filter" : {
"tracks" : ["socialmedia", "social media"]
}
}
}
'

Any help is very appreciated, thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f7579ff-cd63-403c-a731-36902eb8c99d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · December 12, 2013, 8:37pm

I think you are looking for this: https://github.com/elasticsearch/elasticsearch-river-twitter#indexing-raw-twitter-stream

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 déc. 2013 à 21:05, Finn Poitier finnpoitier@googlemail.com a écrit :

Hello,

being new to elasticsearch and trying to get Twitter documents indexed, I tried a few things, which again caused some questions. Let me explain:

First I used Stream2es, which pulled me Twitter documents in quickly, but I didn´t saw any way to filter the tweets as it can be done with the Twitter River.

Then I used the Twitter River, its filtering works as described. But when checking the tweets content, I saw, there are less fields being indexed than in Stream2es, for example user data such as followers_count or friends_count.

So I wondered:

1.) Is there a way to do filtering with Stream2es?
2.) Is there a way to change the mapping of Twitter River? Well, I tried to change the Twitter River mapping, just without success (I didn´t started to index anything) the following way:

Creating a new index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/' -d '
index :
type : mynewmapping
bulk_size : 10
'

Applying a desired mapping to the index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/mynewmapping/_mapping' -d '
{
"mynewmapping" : {
"properties" : {
"user": {
"properties": {
"location": {"type": "string"},
"default_profile": {"type": "boolean"},
"statuses_count": {"type": "long"},
"lang": {"type": "string"},
"id": {"type": "long"},
"favourites_count": {"type": "long"},
and some other properties hereafter...
}
}
}
}
}
'

Starting the Twitter River:

curl -XPUT localhost:9200/_river/twitter_socialmedia_river/_meta -d '
{
"type" : "mynewmapping",
"twitter" : {
"oauth" : {
"consumer_key" : "myValues",
"consumer_secret" : "myValues",
"access_token" : "myValues",
"access_token_secret" : "myValues"
},
"filter" : {
"tracks" : ["socialmedia", "social media"]
}
}
}
'

Any help is very appreciated, thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f7579ff-cd63-403c-a731-36902eb8c99d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F4BF8BFB-5A9F-451C-BF20-85BE5A10B924%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.

Finn_Poitier · December 12, 2013, 10:36pm

Thanks, David!

Ah, that works "somehow":

After I have created the index and added "mynewmapping", I can´t...

curl -XPUT localhost:9200/_river/zweiter_twitter_socialmedia_river/_meta -d
'
{
"type" : "twitter", <= ...enter the type "mynewmapping" here, seems
this value has to be always "twitter" to keep the river functional?
"twitter" : {
"oauth" : {
"consumer_key" : "MyValue",
"consumer_secret" : "MyValue",
"access_token" : "MyValue",
"access_token_secret" : "MyValue"
},
"raw" : true
}
}
'
So when running the river with "type" : "twitter", it adds the mapping
"status" to the index AND mixes the values of "mymapping" in, somehow.

As result I have all the "status" and "mymapping" values.

Now my question is: Would it be possible to just keep the "mymapping"
values without the "status" values to keep the document small with just the
desired values?

Thanks again for your help, I hope my description is not too confusing

On Thursday, December 12, 2013 9:37:38 PM UTC+1, David Pilato wrote:

I think you are looking for this:
GitHub - elastic/elasticsearch-river-twitter: Twitter River Plugin for elasticsearch (STOPPED)

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e15e1c0f-2f25-474f-98e1-bc505286520d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · December 13, 2013, 12:04am

You should create it like this:

curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ",
"consumer_secret" : " YOUR Consumer secret HERE ",
"access_token" : " YOUR Access token HERE ",
"access_token_secret" : " YOUR Access token secret HERE ***"
}, "raw" : true
},
"index" : {
"index" : "my_twitter_river",
"type" : "mymapping",
"bulk_size" : 100
}
}
'

In index part, you set the destination index and type.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 12 décembre 2013 at 23:36:58, Finn Poitier (finnpoitier@googlemail.com) a écrit:

Thanks, David!

Ah, that works "somehow":

After I have created the index and added "mynewmapping", I can´t...

curl -XPUT localhost:9200/_river/zweiter_twitter_socialmedia_river/_meta -d '
{
"type" : "twitter", <= ...enter the type "mynewmapping" here, seems this value has to be always "twitter" to keep the river functional?
"twitter" : {
"oauth" : {
"consumer_key" : "MyValue",
"consumer_secret" : "MyValue",
"access_token" : "MyValue",
"access_token_secret" : "MyValue"
},
"raw" : true
}
}
'
So when running the river with "type" : "twitter", it adds the mapping "status" to the index AND mixes the values of "mymapping" in, somehow.

As result I have all the "status" and "mymapping" values.

Now my question is: Would it be possible to just keep the "mymapping" values without the "status" values to keep the document small with just the desired values?

Thanks again for your help, I hope my description is not too confusing

On Thursday, December 12, 2013 9:37:38 PM UTC+1, David Pilato wrote:
I think you are looking for this: https://github.com/elasticsearch/elasticsearch-river-twitter#indexing-raw-twitter-stream

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e15e1c0f-2f25-474f-98e1-bc505286520d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52aa4f06.519b500d.4b52%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

Finn_Poitier · December 13, 2013, 2:31am

Thanks, David!

That works now insofar, that the river is actually using mymapping and
doesn´t create another one called status.

But when starting the river, it automatically updates mymapping and adds
all available properties from Twitter again. I guess that´s even
intentional, maybe?

As result, the document size got an average of 3kb, while my goal was to
reduce that to just the needed data.

Probably it makes sense to parse the tweet data into Redis and then using a
Redis river to get it into ES?

Also chances, I´m still doing something wrong, as I just started with ES.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84f9f244-93d4-4514-8483-375682e719d0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · December 13, 2013, 6:00am

Ha! This is because raw:true adds more fields to the json doc.
Remove this option and see where it goes.

If you want to remove/add some fields, you can:

do that in another process: means don't use twitter river
use excludes in mapping. See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 déc. 2013 à 03:31, Finn Poitier finnpoitier@googlemail.com a écrit :

Thanks, David!

That works now insofar, that the river is actually using mymapping and doesn´t create another one called status.

But when starting the river, it automatically updates mymapping and adds all available properties from Twitter again. I guess that´s even intentional, maybe?

As result, the document size got an average of 3kb, while my goal was to reduce that to just the needed data.

Probably it makes sense to parse the tweet data into Redis and then using a Redis river to get it into ES?

Also chances, I´m still doing something wrong, as I just started with ES.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84f9f244-93d4-4514-8483-375682e719d0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81198451-7E1B-4910-83E2-5E22E0806314%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.

Finn_Poitier · December 13, 2013, 5:22pm

On Friday, December 13, 2013 7:00:08 AM UTC+1, David Pilato wrote:

Ha! This is because raw:true adds more fields to the json doc.
Remove this option and see where it goes.

Yes I already guessed and tried that last night and tried, the result was a
bit surprising but I bet that partly traces back to a certain lack of
experience I have with Elasticsearch.

What exactly happened was, starting the river without raw:true still
started a mapping update at the beginning and only parts of the entire
Twitter properties got mapped. Still some where additionally added to my
mapping while some of my mapping properties weren´t mapped.

If you want to remove/add some fields, you can:

use excludes in mapping. See:
Elasticsearch Platform — Find real-time answers at scale | Elastic

That´s a very good hint, thanks! I will try that next.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a1846ce-bf85-4a53-8cfa-5c37dda7b37c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Finn_Poitier · January 5, 2014, 8:06am

David,

A few findings and a question:

When I use the Twitter-River with "raw": true, the property
"retweet_count" ist not indexing values and stays at 0 even with over
100k-tweets being indexed. When not using "raw":true, "retweet_count" is
working perfectly well.
"_source" : { "excludes" : ["retweeted_status.*"]} for example works,
retweeted_status. (in this case) gets not indexed anymore even if it´s
still inserted in the automatically updated mapping.
My problem though is still, that, even when excluding almost everything,
the amount of data being used to index 1000 Tweets is tremendously high
(about 5 MB), compared to about just 1.5 MB when not using "raw": true
(your mapping).
So my question is, if it might be possible for me to somehow not use
"raw": true but add a few extra properties I am missing in your mapping
(such as user.followers_count for example)?

Greetings,
Finn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c951b5a1-1c3e-47aa-ba5a-1c81cf88454b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
[ANN] Elasticsearch Twitter River plugin 2.2.0 released Elasticsearch	5	431	July 6, 2017
[ANN] Elasticsearch Twitter River plugin 2.0.0 released Elasticsearch	1	284	July 6, 2017
[ANN] Elasticsearch Twitter River plugin 2.3.0 released Elasticsearch	1	327	July 6, 2017
[ANN] Elasticsearch Twitter River plugin 1.5.0 released Elasticsearch	1	333	July 6, 2017
[ANN] Elasticsearch Twitter River plugin 2.4.0 released Elasticsearch	1	355	July 6, 2017

Stream2es, Twitter River and changes to the mapping

HTH

HTH

HTH

David Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Also chances, I´m still doing something wrong, as I just started with ES.

Related topics

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs