Stream2es, Twitter River and changes to the mapping


(Finn Poitier) #1

Hello,

being new to elasticsearch and trying to get Twitter documents indexed, I
tried a few things, which again caused some questions. Let me explain:

First I used Stream2es, which pulled me Twitter documents in quickly, but I
didn´t saw any way to filter the tweets as it can be done with the Twitter
River.

Then I used the Twitter River, its filtering works as described. But when
checking the tweets content, I saw, there are less fields being indexed
than in Stream2es, for example user data such as followers_count
or friends_count.

So I wondered:

1.) Is there a way to do filtering with Stream2es?
2.) Is there a way to change the mapping of Twitter River? Well, I tried to
change the Twitter River mapping, just without success (I didn´t started to
index anything) the following way:

Creating a new index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/' -d '
index :
type : mynewmapping
bulk_size : 10
'

Applying a desired mapping to the index:

curl -XPUT
'http://localhost:9200/twitter_socialmedia_river/mynewmapping/_mapping' -d '
{
"mynewmapping" : {
"properties" : {
"user": {
"properties": {
"location": {"type": "string"},
"default_profile": {"type": "boolean"},
"statuses_count": {"type": "long"},
"lang": {"type": "string"},
"id": {"type": "long"},
"favourites_count": {"type": "long"},
and some other properties hereafter...
}
}
}
}
}
'

Starting the Twitter River:

curl -XPUT localhost:9200/_river/twitter_socialmedia_river/_meta -d '
{
"type" : "mynewmapping",
"twitter" : {
"oauth" : {
"consumer_key" : "myValues",
"consumer_secret" : "myValues",
"access_token" : "myValues",
"access_token_secret" : "myValues"
},
"filter" : {
"tracks" : ["socialmedia", "social media"]
}
}
}
'

Any help is very appreciated, thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f7579ff-cd63-403c-a731-36902eb8c99d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

I think you are looking for this: https://github.com/elasticsearch/elasticsearch-river-twitter#indexing-raw-twitter-stream

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 déc. 2013 à 21:05, Finn Poitier finnpoitier@googlemail.com a écrit :

Hello,

being new to elasticsearch and trying to get Twitter documents indexed, I tried a few things, which again caused some questions. Let me explain:

First I used Stream2es, which pulled me Twitter documents in quickly, but I didn´t saw any way to filter the tweets as it can be done with the Twitter River.

Then I used the Twitter River, its filtering works as described. But when checking the tweets content, I saw, there are less fields being indexed than in Stream2es, for example user data such as followers_count or friends_count.

So I wondered:

1.) Is there a way to do filtering with Stream2es?
2.) Is there a way to change the mapping of Twitter River? Well, I tried to change the Twitter River mapping, just without success (I didn´t started to index anything) the following way:

Creating a new index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/' -d '
index :
type : mynewmapping
bulk_size : 10
'

Applying a desired mapping to the index:

curl -XPUT 'http://localhost:9200/twitter_socialmedia_river/mynewmapping/_mapping' -d '
{
"mynewmapping" : {
"properties" : {
"user": {
"properties": {
"location": {"type": "string"},
"default_profile": {"type": "boolean"},
"statuses_count": {"type": "long"},
"lang": {"type": "string"},
"id": {"type": "long"},
"favourites_count": {"type": "long"},
and some other properties hereafter...
}
}
}
}
}
'

Starting the Twitter River:

curl -XPUT localhost:9200/_river/twitter_socialmedia_river/_meta -d '
{
"type" : "mynewmapping",
"twitter" : {
"oauth" : {
"consumer_key" : "myValues",
"consumer_secret" : "myValues",
"access_token" : "myValues",
"access_token_secret" : "myValues"
},
"filter" : {
"tracks" : ["socialmedia", "social media"]
}
}
}
'

Any help is very appreciated, thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f7579ff-cd63-403c-a731-36902eb8c99d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F4BF8BFB-5A9F-451C-BF20-85BE5A10B924%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Finn Poitier) #3

Thanks, David!

Ah, that works "somehow":

After I have created the index and added "mynewmapping", I can´t...

curl -XPUT localhost:9200/_river/zweiter_twitter_socialmedia_river/_meta -d
'
{
"type" : "twitter", <= ...enter the type "mynewmapping" here, seems
this value has to be always "twitter" to keep the river functional?
"twitter" : {
"oauth" : {
"consumer_key" : "MyValue",
"consumer_secret" : "MyValue",
"access_token" : "MyValue",
"access_token_secret" : "MyValue"
},
"raw" : true
}
}
'
So when running the river with "type" : "twitter", it adds the mapping
"status" to the index AND mixes the values of "mymapping" in, somehow.

As result I have all the "status" and "mymapping" values.

Now my question is: Would it be possible to just keep the "mymapping"
values without the "status" values to keep the document small with just the
desired values?

Thanks again for your help, I hope my description is not too confusing :wink:

On Thursday, December 12, 2013 9:37:38 PM UTC+1, David Pilato wrote:

I think you are looking for this:
https://github.com/elasticsearch/elasticsearch-river-twitter#indexing-raw-twitter-stream

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e15e1c0f-2f25-474f-98e1-bc505286520d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

You should create it like this:

curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ",
"consumer_secret" : "
YOUR Consumer secret HERE ",
"access_token" : "
YOUR Access token HERE ",
"access_token_secret" : "
YOUR Access token secret HERE ***"
}, "raw" : true
},
"index" : {
"index" : "my_twitter_river",
"type" : "mymapping",
"bulk_size" : 100
}
}
'

In index part, you set the destination index and type.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 12 décembre 2013 at 23:36:58, Finn Poitier (finnpoitier@googlemail.com) a écrit:

Thanks, David!

Ah, that works "somehow":

After I have created the index and added "mynewmapping", I can´t...

curl -XPUT localhost:9200/_river/zweiter_twitter_socialmedia_river/_meta -d '
{
"type" : "twitter", <= ...enter the type "mynewmapping" here, seems this value has to be always "twitter" to keep the river functional?
"twitter" : {
"oauth" : {
"consumer_key" : "MyValue",
"consumer_secret" : "MyValue",
"access_token" : "MyValue",
"access_token_secret" : "MyValue"
},
"raw" : true
}
}
'
So when running the river with "type" : "twitter", it adds the mapping "status" to the index AND mixes the values of "mymapping" in, somehow.

As result I have all the "status" and "mymapping" values.

Now my question is: Would it be possible to just keep the "mymapping" values without the "status" values to keep the document small with just the desired values?

Thanks again for your help, I hope my description is not too confusing :wink:

On Thursday, December 12, 2013 9:37:38 PM UTC+1, David Pilato wrote:
I think you are looking for this: https://github.com/elasticsearch/elasticsearch-river-twitter#indexing-raw-twitter-stream

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e15e1c0f-2f25-474f-98e1-bc505286520d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52aa4f06.519b500d.4b52%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Finn Poitier) #5

Thanks, David!

That works now insofar, that the river is actually using mymapping and
doesn´t create another one called status.

But when starting the river, it automatically updates mymapping and adds
all available properties from Twitter again. I guess that´s even
intentional, maybe?

As result, the document size got an average of 3kb, while my goal was to
reduce that to just the needed data.

Probably it makes sense to parse the tweet data into Redis and then using a
Redis river to get it into ES?

Also chances, I´m still doing something wrong, as I just started with ES.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84f9f244-93d4-4514-8483-375682e719d0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #6

Ha! This is because raw:true adds more fields to the json doc.
Remove this option and see where it goes.

If you want to remove/add some fields, you can:

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 13 déc. 2013 à 03:31, Finn Poitier finnpoitier@googlemail.com a écrit :

Thanks, David!

That works now insofar, that the river is actually using mymapping and doesn´t create another one called status.

But when starting the river, it automatically updates mymapping and adds all available properties from Twitter again. I guess that´s even intentional, maybe?

As result, the document size got an average of 3kb, while my goal was to reduce that to just the needed data.

Probably it makes sense to parse the tweet data into Redis and then using a Redis river to get it into ES?

Also chances, I´m still doing something wrong, as I just started with ES.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84f9f244-93d4-4514-8483-375682e719d0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81198451-7E1B-4910-83E2-5E22E0806314%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Finn Poitier) #7

On Friday, December 13, 2013 7:00:08 AM UTC+1, David Pilato wrote:

Ha! This is because raw:true adds more fields to the json doc.
Remove this option and see where it goes.

Yes I already guessed and tried that last night and tried, the result was a
bit surprising but I bet that partly traces back to a certain lack of
experience I have with Elasticsearch.

What exactly happened was, starting the river without raw:true still
started a mapping update at the beginning and only parts of the entire
Twitter properties got mapped. Still some where additionally added to my
mapping while some of my mapping properties weren´t mapped.

If you want to remove/add some fields, you can:

That´s a very good hint, thanks! I will try that next.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a1846ce-bf85-4a53-8cfa-5c37dda7b37c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Finn Poitier) #8

David,

A few findings and a question:

  • When I use the Twitter-River with "raw": true, the property
    "retweet_count" ist not indexing values and stays at 0 even with over
    100k-tweets being indexed. When not using "raw":true, "retweet_count" is
    working perfectly well.
  • "_source" : { "excludes" : ["retweeted_status.*"]} for example works,
    retweeted_status. (in this case) gets not indexed anymore even if it´s
    still inserted in the automatically updated mapping.
  • My problem though is still, that, even when excluding almost everything,
    the amount of data being used to index 1000 Tweets is tremendously high
    (about 5 MB), compared to about just 1.5 MB when not using "raw": true
    (your mapping).
  • So my question is, if it might be possible for me to somehow not use
    "raw": true but add a few extra properties I am missing in your mapping
    (such as user.followers_count for example)?

Greetings,
Finn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c951b5a1-1c3e-47aa-ba5a-1c81cf88454b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9