ElasticSearch bulk api performance


(allwefantasy) #1

here is my code using bulk api:
https://gist.github.com/1071233

single node (node info and performance log)
https://gist.github.com/1071237

i find it really really slow. 20 documents per second! if i start two nodes in different machine,
it becomes only 9 documents per second.
anyone know why?


(imarcticblue) #2

How large are your documents and how many are you indexing at a time? We're not using a DataItem, just raw JSON and we can index 40M records in about 3:30 at 3200 docs/sec. Average record size is .5k. This is on AWS with 2 large nodes, 32 shards, 2 replicas. Your hardware looks beefier than what we have on AWS.

  • Craig

(jrawlings) #3

Are you indexing to a local ES node? How large are the DataItems?

From my experience doing bulk prepares to a non-local ES node, I
noticed I maxed out my network connection (10mb) quite quickly..

On Jul 7, 11:10 pm, allwefantasy allwefant...@gmail.com wrote:

here is my code using bulk api:https://gist.github.com/1071233https://gist.github.com/1071233

single node (node info and performance log)https://gist.github.com/1071237https://gist.github.com/1071237

i find it really really slow. 20 documents per second! if i start two nodes
in different machine,
it becomes only 9 documents per second.
anyone know why?

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-ap...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(allwefantasy) #4

DataItem Class contains id and source fields. source is raw json data from blog article. 3200 docs/sec is really awesome! i still have no idea why it so slow in my application


(allwefantasy) #5

yes.Index to a local node. DataItem contains one blog article. and each time I bulk index 1000 DataItems.

From: jrawlings [via ElasticSearch Users]
Sent: Saturday, July 09, 2011 5:25 AM
To: allwefantasy
Subject: Re: ElasticSearch bulk api performance

Are you indexing to a local ES node? How large are the DataItems?

From my experience doing bulk prepares to a non-local ES node, I
noticed I maxed out my network connection (10mb) quite quickly..

On Jul 7, 11:10 pm, allwefantasy <[hidden email]> wrote:

here is my code using bulk api:https://gist.github.com/1071233https://gist.github.com/1071233

single node (node info and performance log)https://gist.github.com/1071237https://gist.github.com/1071237

i find it really really slow. 20 documents per second! if i start two nodes
in different machine,
it becomes only 9 documents per second.
anyone know why?

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-ap...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-api-performance-tp3150866p3153370.html
To unsubscribe from ElasticSearch bulk api performance, click here.


(allwefantasy) #6

6 milliam blog articles will be indexed and whole index files are 62G .
1000 DataItems indexed at a time.

From: imarcticblue [via ElasticSearch Users]
Sent: Saturday, July 09, 2011 1:15 AM
To: allwefantasy
Subject: Re: ElasticSearch bulk api performance

How large are your documents and how many are you indexing at a time? We're not using a DataItem, just raw JSON and we can index 40M records in about 3:30 at 3200 docs/sec. Average record size is .5k. This is on AWS with 2 large nodes, 32 shards, 2 replicas. Your hardware looks beefier than what we have on AWS.

  • Craig

If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-api-performance-tp3150866p3152481.html
To unsubscribe from ElasticSearch bulk api performance, click here.


(Craig Brown) #7

So your docs are about 10K each? Are you doing any kind of other
transformation on your data? The code you showed is virtually the same
as mine, but I'm not using a DataItem. My data files are JSON, one per
row. I simple read each row, set the id and source, then use
client.prepareIndex(). We index 10,000 docs at a time. THe files
contain 20M docs and the files are compressed using gz. It's basically
just as fast to read from compressed files plus you get much smaller
files to push around :slight_smile:

  • Craig

On Jul 9, 8:58 pm, allwefantasy allwefant...@gmail.com wrote:

6 milliam blog articles will be indexed and whole index files are 62G .
1000 DataItems indexed at a time.

From: imarcticblue [via ElasticSearch Users]
Sent: Saturday, July 09, 2011 1:15 AM
To: allwefantasy
Subject: Re: ElasticSearch bulk api performance

How large are your documents and how many are you indexing at a time? We're not using a DataItem, just raw JSON and we can index 40M records in about 3:30 at 3200 docs/sec. Average record size is .5k. This is on AWS with 2 large nodes, 32 shards, 2 replicas. Your hardware looks beefier than what we have on AWS.

  • Craig

If you reply to this email, your message will be added to the discussion below:http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-ap...
To unsubscribe from ElasticSearch bulk api performance, click here.

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-bulk-ap...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #8