Intial load and Cluster configuration query

Hi All,
I have below queries pertaining to elastic search

  1. In our application we would like to index around 2-3 million documents
    what should be the best cluster/Shard configuration ?
  2. Our initial load consists of indexing 2-3 million records what are
    configuration we should make in elastic search so that we achieve faster
    indexing for initial load ?
  3. Currenlty we have a batch process which spawns 10 threads this threads
    sends request over to elastic search server? Do we forsee any issue with
    this approach (like write lock)?

Thanks for your help.

Regards,
Lalit.

On Thursday, May 19, 2011 at 8:32 PM, lalit mishra wrote:
Hi All,

I have below queries pertaining to Elasticsearch

  1. In our application we would like to index around 2-3 million documents what should be the best cluster/Shard configuration ?

Hard to tell, since I don't know the size of the documents. You will need to do some capacity planning. Generally, the default 5 shards should be more than enough for 2-3 million, and depending on the docs, you can even use lower number of shards (will use less memory).

  1. Our initial load consists of indexing 2-3 million records what are configuration we should make in Elasticsearch so that we achieve faster indexing for initial load ?

Indexing rate really depends on many factors. As for scaling out, the default 5 shards set for an index will mean you can grow upto 5 machines. If oyu have 1 replica (the default), then you can grow up to 10 machines without hitting a wall.

  1. Currenlty we have a batch process which spawns 10 threads this threads sends request over to Elasticsearch server? Do we forsee any issue with this approach (like write lock)?

No, no issues here. Check that your indexing machine is not the bottleneck.

Thanks for your help.

Regards,
Lalit.

  1. Our initial load consists of indexing 2-3 million records what are
    configuration we should make in Elasticsearch so that we achieve faster
    indexing for initial load ?

hopefully I'm not wrong with this, but you can increase the real time
latency (refresh_interval) which improves indexing. @Shay: Is that a
correct assumption?

Also tuning lucene's merge factor (increase it) can improve indexing
speed (but you pay querying time).

You can change those in real time, check the update settings API which has a sample for it: Elasticsearch Platform — Find real-time answers at scale | Elastic.
On Friday, May 20, 2011 at 10:46 PM, Karussell wrote:

  1. Our initial load consists of indexing 2-3 million records what are

configuration we should make in Elasticsearch so that we achieve faster
indexing for initial load ?

hopefully I'm not wrong with this, but you can increase the real time
latency (refresh_interval) which improves indexing. @Shay: Is that a
correct assumption?

Also tuning lucene's merge factor (increase it) can improve indexing
speed (but you pay querying time).

Hi Shay,
So I need to use below configuration during initial load of the records

curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "-1",
"merge.policy.merge_factor" : 30
}
}'

Once the initial load is over then revert the indices settings to

curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "1s",
"merge.policy.merge_factor" : 10
}
}'

curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'

Hope this should further speed up the process.

Thanks,
Lalit.

On Sat, May 21, 2011 at 1:22 AM, Shay Banon shay.banon@elasticsearch.comwrote:

You can change those in real time, check the update settings API which
has a sample for it:
Elasticsearch Platform — Find real-time answers at scale | Elastic
.

On Friday, May 20, 2011 at 10:46 PM, Karussell wrote:

  1. Our initial load consists of indexing 2-3 million records what are
    configuration we should make in Elasticsearch so that we achieve faster
    indexing for initial load ?

hopefully I'm not wrong with this, but you can increase the real time
latency (refresh_interval) which improves indexing. @Shay: Is that a
correct assumption?

Also tuning lucene's merge factor (increase it) can improve indexing
speed (but you pay querying time).