Questions about scaling elasticsearch with regard to the number of documents indexed per second

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpokli@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0C83816D-AD64-4C6D-B573-C287B8222F2B%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpokli@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8C9F9246-69DD-41B7-85A1-0269A08FB1B9%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:

You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli cpokli@gmail.com a écrit :

No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FD1F8969-377F-420C-A2CF-438F7383C890%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Awesome! Great to know that. So as a conclusion the steps will be:

  1. Stream tweets from twitter
  2. Use the bulk API to make batches of 1000 (or more) tweets
  3. Once the batch size is reached, spawn a new thread which will index the
    data into ES, meanwhile my original thread will continue streaming tweets

Do these steps sound alright to you or did I miss something?

On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:

I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:

You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that
it can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am
fine with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11bf4f30-d7f6-41ac-886a-c5281dac31bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sounds good.
If you are using Java, you could also look at the river code.
Note that you should use BulkProcessor class which is super handy.

BTW I said 10000/s but not for tweets. I have less fields (20) than Twitter (>100).
With more fields, I guess it would take more time. Though with better machines, it could work. I'd say that you need to test on the production cluster.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 15:40, Chinch Pokli cpokli@gmail.com a écrit :

Awesome! Great to know that. So as a conclusion the steps will be:

  1. Stream tweets from twitter
  2. Use the bulk API to make batches of 1000 (or more) tweets
  3. Once the batch size is reached, spawn a new thread which will index the data into ES, meanwhile my original thread will continue streaming tweets

Do these steps sound alright to you or did I miss something?

On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:
I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli cpo...@gmail.com a écrit :

No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11bf4f30-d7f6-41ac-886a-c5281dac31bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/BFC7C54C-3118-4C00-AD0A-76950F51AD11%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.