Questions about scaling elasticsearch with regard to the number of documents indexed per second

Chinch_Pokli · January 14, 2015, 10:29pm

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · January 14, 2015, 10:35pm

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpokli@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0C83816D-AD64-4C6D-B573-C287B8222F2B%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Chinch_Pokli · January 14, 2015, 11:02pm

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · January 15, 2015, 5:17am

You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpokli@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8C9F9246-69DD-41B7-85A1-0269A08FB1B9%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Chinch_Pokli · January 15, 2015, 12:43pm

No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:

You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it
can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am fine
with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · January 15, 2015, 2:27pm

I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli cpokli@gmail.com a écrit :

No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FD1F8969-377F-420C-A2CF-438F7383C890%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Chinch_Pokli · January 15, 2015, 2:40pm

Awesome! Great to know that. So as a conclusion the steps will be:

Stream tweets from twitter
Use the bulk API to make batches of 1000 (or more) tweets
Once the batch size is reached, spawn a new thread which will index the
data into ES, meanwhile my original thread will continue streaming tweets

Do these steps sound alright to you or did I miss something?

On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:

I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli <cpo...@gmail.com <javascript:>> a
écrit :

No, so the whole point was that, will elasticsearch be able to index say
10,000 documents per second? If yes, I can simply hook up my twitter code
to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is
pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get
some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:

You have a Twitter input so you can extract content from Twitter and send
to elasticsearch. No need to have Redis here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that
it can take messages from a Redis server. But if I have to set up Redis, I
could simply use the Redis river to index into Elasticsearch. Is there any
additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:

You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was
using the official river which was working great but realized that it
throwing out much of the data (e.g. it is not storing number of followers
etc. data).

Is there a way to make the river to store all the data? If not, I am
fine with writing a streaming code which will stream and index. But have a
concern. How many documents can elasticsearch index per second? I might
eventually need to index almost 10,000 documents (each document = 2 KB) per
second (current requirement is of 100 documents per second). Is this even
feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11bf4f30-d7f6-41ac-886a-c5281dac31bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · January 15, 2015, 2:55pm

Sounds good.
If you are using Java, you could also look at the river code.
Note that you should use BulkProcessor class which is super handy.

BTW I said 10000/s but not for tweets. I have less fields (20) than Twitter (>100).
With more fields, I guess it would take more time. Though with better machines, it could work. I'd say that you need to test on the production cluster.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 15:40, Chinch Pokli cpokli@gmail.com a écrit :

Awesome! Great to know that. So as a conclusion the steps will be:

Stream tweets from twitter

Use the bulk API to make batches of 1000 (or more) tweets

Once the batch size is reached, spawn a new thread which will index the data into ES, meanwhile my original thread will continue streaming tweets

Do these steps sound alright to you or did I miss something?

On Thursday, January 15, 2015 at 7:58:19 PM UTC+5:30, David Pilato wrote:
I can index on my laptop 10000-12000 docs per second. SSD drives of course.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 13:43, Chinch Pokli cpo...@gmail.com a écrit :

No, so the whole point was that, will elasticsearch be able to index say 10,000 documents per second? If yes, I can simply hook up my twitter code to es. If not, I would need to think of how to make that happen.
Typically I've seen es indexes just around 30 docs per second which is pretty low.

I am hoping Redis/ Kafka/ Logstash/ etc. might help elasticsearch to get some breathing room and enable it to index up to 10K docs per second.

On Thursday, January 15, 2015 at 10:47:31 AM UTC+5:30, David Pilato wrote:
You have a Twitter input so you can extract content from Twitter and send to elasticsearch. No need to have Redis here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 15 janv. 2015 à 00:02, Chinch Pokli cpo...@gmail.com a écrit :

Thanks. I'll have a look at the raw option.
Regarding logstash, I don't fully understand it's utility. It says that it can take messages from a Redis server. But if I have to set up Redis, I could simply use the Redis river to index into Elasticsearch. Is there any additional benefit that Logstash would give me?

On Thursday, January 15, 2015 at 4:06:12 AM UTC+5:30, David Pilato wrote:
You should look at raw option or better look at Logstash.

My 2 cents.

David

Le 14 janv. 2015 à 23:29, Chinch Pokli cpo...@gmail.com a écrit :

Hi,

I am using elasticsearch to index twitter stream. Until recently I was using the official river which was working great but realized that it throwing out much of the data (e.g. it is not storing number of followers etc. data).

Is there a way to make the river to store all the data? If not, I am fine with writing a streaming code which will stream and index. But have a concern. How many documents can elasticsearch index per second? I might eventually need to index almost 10,000 documents (each document = 2 KB) per second (current requirement is of 100 documents per second). Is this even feasible? If yes, do I need to make any special modifications?

Thanks-in-advance!!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da547692-903b-4793-a77e-fd5f0b5a01b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d89e6057-ab58-49ef-a553-c5bd5265c172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5c75aed-e290-4152-9f8d-160510f3ecfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11bf4f30-d7f6-41ac-886a-c5281dac31bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/BFC7C54C-3118-4C00-AD0A-76950F51AD11%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Document Processing Elasticsearch	3	789	July 6, 2017
Slow to index Elasticsearch	13	362	July 6, 2017
Stream2es, Twitter River and changes to the mapping Elasticsearch	8	388	July 6, 2017
Couchdb river index performance slows down after a few hours Elasticsearch	1	303	July 6, 2017
7 seconds to index document once i get close to 2million documents Elasticsearch	4	757	April 1, 2018

Questions about scaling elasticsearch with regard to the number of documents indexed per second

Thanks-in-advance!!

Thanks-in-advance!!

Thanks-in-advance!!

Thanks-in-advance!!

Related topics