High volume Indexing of Documents

Hi,

We have an application that generates around 7000-10000 JSON messages
per second. Each message size is around 2.6 KB. What are the best
practices that needs to be followed at the java API level so that my
application as well as Elastic-Search scales well.

Right now my application and ElasticSearch are residing on same box. I
intend to use Java ElasticSearch client using a node of type client as
suggested in documentation here http://www.elasticsearch.org/guide/reference/java-api/client.html.
Since my application is multithreaded I will share client with them,
is it ok?

For high data writes in ElasticSearch is using Bulk API better?

Please suggest any other best practices I can include in my
implementation. I will like to scale to 13 nodes in a cluster soon.

Regards,
Meetu Maltiar

--

Hi,

Bulk is good indeed. -Xmx and JVM settings matter. If this is
write-heavy, relatively speaking, any index merging params should be looked
at. Refresh interval can/should be high unless you really need NRT.

May be best to wait until/if you hit issues and then you can provide
concrete info about what you are doing and others can provide feedback.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 10:43:12 PM UTC-5, Meetu Maltiar wrote:

Hi,

We have an application that generates around 7000-10000 JSON messages
per second. Each message size is around 2.6 KB. What are the best
practices that needs to be followed at the java API level so that my
application as well as Elastic-Search scales well.

Right now my application and ElasticSearch are residing on same box. I
intend to use Java ElasticSearch client using a node of type client as
suggested in documentation here
http://www.elasticsearch.org/guide/reference/java-api/client.html.
Since my application is multithreaded I will share client with them,
is it ok?

For high data writes in ElasticSearch is using Bulk API better?

Please suggest any other best practices I can include in my
implementation. I will like to scale to 13 nodes in a cluster soon.

Regards,
Meetu Maltiar

--

Thanks a lot Otis,

I am going with your suggestion of using bulk api. I will look at the
a) JVM settings b) index merging patterns c) Refresh interval.

Right now I have a singleton node and have "node-client" that is
shared by threads. Is this fine? I am trying to minimise client
creation in each call otherwise I will have to create client for each
document to be Indexed.

Regards,
Meetu Maltiar

On Dec 19, 10:17 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi,

Bulk is good indeed. -Xmx and JVM settings matter. If this is
write-heavy, relatively speaking, any index merging params should be looked
at. Refresh interval can/should be high unless you really need NRT.

May be best to wait until/if you hit issues and then you can provide
concrete info about what you are doing and others can provide feedback.

Otis

ELASTICSEARCH Performance Monitoring -http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 10:43:12 PM UTC-5, Meetu Maltiar wrote:

Hi,

We have an application that generates around 7000-10000 JSON messages
per second. Each message size is around 2.6 KB. What are the best
practices that needs to be followed at the java API level so that my
application as well as Elastic-Search scales well.

Right now my application and ElasticSearch are residing on same box. I
intend to use Java ElasticSearch client using a node of type client as
suggested in documentation here
http://www.elasticsearch.org/guide/reference/java-api/client.html.
Since my application is multithreaded I will share client with them,
is it ok?

For high data writes in ElasticSearch is using Bulk API better?

Please suggest any other best practices I can include in my
implementation. I will like to scale to 13 nodes in a cluster soon.

Regards,
Meetu Maltiar

--

Yes. Share the same client within all threads.
BTW, If you are using Spring, you can have look at this: https://github.com/dadoonet/spring-elasticsearch

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 déc. 2012 à 07:15, Meetu Maltiar meetu@knoldus.com a écrit :

Thanks a lot Otis,

I am going with your suggestion of using bulk api. I will look at the
a) JVM settings b) index merging patterns c) Refresh interval.

Right now I have a singleton node and have "node-client" that is
shared by threads. Is this fine? I am trying to minimise client
creation in each call otherwise I will have to create client for each
document to be Indexed.

Regards,
Meetu Maltiar

On Dec 19, 10:17 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi,

Bulk is good indeed. -Xmx and JVM settings matter. If this is
write-heavy, relatively speaking, any index merging params should be looked
at. Refresh interval can/should be high unless you really need NRT.

May be best to wait until/if you hit issues and then you can provide
concrete info about what you are doing and others can provide feedback.

Otis

ELASTICSEARCH Performance Monitoring -http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 10:43:12 PM UTC-5, Meetu Maltiar wrote:

Hi,

We have an application that generates around 7000-10000 JSON messages
per second. Each message size is around 2.6 KB. What are the best
practices that needs to be followed at the java API level so that my
application as well as Elastic-Search scales well.

Right now my application and ElasticSearch are residing on same box. I
intend to use Java ElasticSearch client using a node of type client as
suggested in documentation here
http://www.elasticsearch.org/guide/reference/java-api/client.html.
Since my application is multithreaded I will share client with them,
is it ok?

For high data writes in ElasticSearch is using Bulk API better?

Please suggest any other best practices I can include in my
implementation. I will like to scale to 13 nodes in a cluster soon.

Regards,
Meetu Maltiar

--

--

Thanks David,

NIce will share client across. BTW, I am using Scala as a language,
Elastic Search Java API and using Akka for parallelizing things.
Though not using Spring at the moment, may do so after some time.
Thanks for the github link it looks gr8 :slight_smile: to use.

Meetu Maltiar
Twitter: @meetumaltiar

On Dec 19, 11:32 am, David Pilato da...@pilato.fr wrote:

Yes. Share the same client within all threads.
BTW, If you are using Spring, you can have look at this:https://github.com/dadoonet/spring-elasticsearch

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 déc. 2012 à 07:15, Meetu Maltiar me...@knoldus.com a écrit :

Thanks a lot Otis,

I am going with your suggestion of using bulk api. I will look at the
a) JVM settings b) index merging patterns c) Refresh interval.

Right now I have a singleton node and have "node-client" that is
shared by threads. Is this fine? I am trying to minimise client
creation in each call otherwise I will have to create client for each
document to be Indexed.

Regards,
Meetu Maltiarhttp://blog.knoldus.com/

On Dec 19, 10:17 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hi,

Bulk is good indeed. -Xmx and JVM settings matter. If this is
write-heavy, relatively speaking, any index merging params should be looked
at. Refresh interval can/should be high unless you really need NRT.

May be best to wait until/if you hit issues and then you can provide
concrete info about what you are doing and others can provide feedback.

Otis

ELASTICSEARCH Performance Monitoring -http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 10:43:12 PM UTC-5, Meetu Maltiar wrote:

Hi,

We have an application that generates around 7000-10000 JSON messages
per second. Each message size is around 2.6 KB. What are the best
practices that needs to be followed at the java API level so that my
application as well as Elastic-Search scales well.

Right now my application and ElasticSearch are residing on same box. I
intend to use Java ElasticSearch client using a node of type client as
suggested in documentation here
http://www.elasticsearch.org/guide/reference/java-api/client.html.
Since my application is multithreaded I will share client with them,
is it ok?

For high data writes in ElasticSearch is using Bulk API better?

Please suggest any other best practices I can include in my
implementation. I will like to scale to 13 nodes in a cluster soon.

Regards,
Meetu Maltiar

--

--