I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).
Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpokli@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).
Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).
Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 00:02, Chinch Pokli cpokli@gmail.com a écrit :
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).
Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?
No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.
I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.
On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 00:02, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).
Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?
I can index on my laptop 10000-12000 docs per second. SSD drives of course.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 13:43, Chinch Pokli cpokli@gmail.com a écrit :
No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.
I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.
On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).
Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?
Awesome! Great to know that. So as a conclusion the steps will be:
Stream tweets from twitter
Use the bulk API to make batches of 1000 (or more) tweets
Once the batch size is reached, spawn a new thread which will index the
data into ES, meanwhile my original thread will continue streaming tweets
Do these steps sound alright to you or did I miss something?
On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:
I can index on my laptop 10000-12000 docs per second. SSD drives of course.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 13:43, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :
No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.
I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.
On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that
it can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).
Is there a way to make the river to store all the data? If not, I am
fine with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?
Sounds good.
If you are using Java, you could also look at the river code.
Note that you should use BulkProcessor class which is super handy.
BTW I said 10000/s but not for tweets. I have less fields (20) than Twitter (>100).
With more fields, I guess it would take more time. Though with better machines, it could work. I'd say that you need to test on the production cluster.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 15:40, Chinch Pokli cpokli@gmail.com a écrit :
Awesome! Great to know that. So as a conclusion the steps will be:
Stream tweets from twitter
Use the bulk API to make batches of 1000 (or more) tweets
Once the batch size is reached, spawn a new thread which will index the data into ES, meanwhile my original thread will continue streaming tweets
Do these steps sound alright to you or did I miss something?
On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:
I can index on my laptop 10000-12000 docs per second. SSD drives of course.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 13:43, Chinch Pokli cpo...@gmail.com a écrit :
No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.
I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.
On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :
Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?
On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.
My 2 cents.
David
Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :
Hi,
I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).
Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.