Optimizing number of shards

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?

allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
tableyourtime@googlemail.com wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

nm, found it:
"index" : {
"number_of_shards":1,
"number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag anurag.phadke@gmail.com wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
tableyourtime@googlemail.com wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

Check index templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
"number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag <anurag.phadke@gmail.com (mailto:anurag.phadke@gmail.com)> wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
<tableyourtime@googlemail.com (mailto:tableyourtime@googlemail.com)> wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag <anurag.pha...@gmail.com (http://gmail.com)> wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

Shay,
Does having a template affect performance? With the bulk index API, is
the template invoked everytime a document is indexed or is there some
sort of internal flag to only invoke template for new indices?

-anurag

On Sun, Jul 10, 2011 at 2:28 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Check index
templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to
simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
""number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag anurag.phadke@gmail.com wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
tableyourtime@googlemail.com wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

The index template is applied only when the index gets created.

On Monday, July 11, 2011 at 1:58 AM, Anurag wrote:

Shay,
Does having a template affect performance? With the bulk index API, is
the template invoked everytime a document is indexed or is there some
sort of internal flag to only invoke template for new indices?

-anurag

On Sun, Jul 10, 2011 at 2:28 PM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Check index
templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to
simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
""number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag <anurag.phadke@gmail.com (mailto:anurag.phadke@gmail.com)> wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
<tableyourtime@googlemail.com (mailto:tableyourtime@googlemail.com)> wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag <anurag.pha...@gmail.com (http://gmail.com)> wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

The following returns:
{"error":"ClassCastException[java.util.ArrayList cannot be cast to
java.util.Map]","status":500}

curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "socorro*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"date_formats": [
"yyyy-MM-dd HH:mm:ss.SSSSSS"
],
"properties": {
"completeddatetime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"date_processed": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"client_crash_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"build_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"startedDateTime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"signature": {
"type": "multi_field",
"fields": {
"signature": {
"type": "string",
"index": "analyzed"
},
"full": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
'

On Sun, Jul 10, 2011 at 4:04 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

The index template is applied only when the index gets created.

On Monday, July 11, 2011 at 1:58 AM, Anurag wrote:

Shay,
Does having a template affect performance? With the bulk index API, is
the template invoked everytime a document is indexed or is there some
sort of internal flag to only invoke template for new indices?

-anurag

On Sun, Jul 10, 2011 at 2:28 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Check index
templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to
simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
""number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag anurag.phadke@gmail.com wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
tableyourtime@googlemail.com wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

nm, didn't realize it needs a type, the following works:

curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "socorro*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"crash_reports": {
"date_formats": [
"yyyy-MM-dd HH:mm:ss.SSSSSS"
],
"properties": {
"completeddatetime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"date_processed": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"client_crash_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"build_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"startedDateTime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"signature": {
"type": "multi_field",
"fields": {
"signature": {
"type": "string",
"index": "analyzed"
},
"full": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
'

On Sun, Jul 10, 2011 at 5:37 PM, Anurag anurag.phadke@gmail.com wrote:

The following returns:
{"error":"ClassCastException[java.util.ArrayList cannot be cast to
java.util.Map]","status":500}

curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "socorro*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"date_formats": [
"yyyy-MM-dd HH:mm:ss.SSSSSS"
],
"properties": {
"completeddatetime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"date_processed": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"client_crash_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"build_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"startedDateTime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"signature": {
"type": "multi_field",
"fields": {
"signature": {
"type": "string",
"index": "analyzed"
},
"full": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
'

On Sun, Jul 10, 2011 at 4:04 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

The index template is applied only when the index gets created.

On Monday, July 11, 2011 at 1:58 AM, Anurag wrote:

Shay,
Does having a template affect performance? With the bulk index API, is
the template invoked everytime a document is indexed or is there some
sort of internal flag to only invoke template for new indices?

-anurag

On Sun, Jul 10, 2011 at 2:28 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Check index
templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to
simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
""number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag anurag.phadke@gmail.com wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
tableyourtime@googlemail.com wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag anurag.pha...@gmail.com wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag

Yuck!, what a fugly failure message, pushed a nicer one to master.

On Monday, July 11, 2011 at 3:41 AM, Anurag wrote:

nm, didn't realize it needs a type, the following works:

curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "socorro*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"crash_reports": {
"date_formats": [
"yyyy-MM-dd HH:mm:ss.SSSSSS"
],
"properties": {
"completeddatetime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"date_processed": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"client_crash_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"build_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"startedDateTime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"signature": {
"type": "multi_field",
"fields": {
"signature": {
"type": "string",
"index": "analyzed"
},
"full": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
'

On Sun, Jul 10, 2011 at 5:37 PM, Anurag <anurag.phadke@gmail.com (mailto:anurag.phadke@gmail.com)> wrote:

The following returns:
{"error":"ClassCastException[java.util.ArrayList cannot be cast to
java.util.Map]","status":500}

curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "socorro*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"date_formats": [
"yyyy-MM-dd HH:mm:ss.SSSSSS"
],
"properties": {
"completeddatetime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"date_processed": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"client_crash_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"build_date": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"startedDateTime": {
"format": "yyyy-MM-dd HH:mm:ss.SSSSSS",
"type": "date"
},
"signature": {
"type": "multi_field",
"fields": {
"signature": {
"type": "string",
"index": "analyzed"
},
"full": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
'

On Sun, Jul 10, 2011 at 4:04 PM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The index template is applied only when the index gets created.

On Monday, July 11, 2011 at 1:58 AM, Anurag wrote:

Shay,
Does having a template affect performance? With the bulk index API, is
the template invoked everytime a document is indexed or is there some
sort of internal flag to only invoke template for new indices?

-anurag

On Sun, Jul 10, 2011 at 2:28 PM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Check index
templates: Elasticsearch Platform — Find real-time answers at scale | Elastic to
simplify the settings / mappings to automatically apply for this case.

On Sunday, July 10, 2011 at 10:47 PM, Anurag wrote:

nm, found it:
"index" : {
"number_of_shards":1,
""number_of_replicas":1,
}

}

On Sun, Jul 10, 2011 at 12:43 PM, Anurag <anurag.phadke@gmail.com (mailto:anurag.phadke@gmail.com)> wrote:

I don't see any latency while its getting indexed, so I guess that's okay.

On same note, is there a way to specify default # of shards for the
entire cluster?
Elasticsearch Platform — Find real-time answers at scale | Elastic
allows one to specify for a given index, but I don't see a default
systemwide setting as it exists for mapping:
Elasticsearch Platform — Find real-time answers at scale | Elastic

-anurag

On Sun, Jul 10, 2011 at 12:31 PM, Karussell
<tableyourtime@googlemail.com (mailto:tableyourtime@googlemail.com)> wrote:

I would use only one shard IMHO (as 5 nodes for 90 indices seems to be
already a high load ;)).

But check if indexing speed is enough...

On 10 Jul., 21:27, Anurag <anurag.pha...@gmail.com (http://gmail.com)> wrote:

We are indexing about 3m documents / day (20kb/doc), for a total of
90-days with 1-replica. Each day corresponds to a new index, thereby
creating 90 indexes. In addition, each index uses a custom mapping,
this mapping is the same for all 90-indexes.

Given the above scenario, what would be an optimum number of shards
and any other setting that should be considered for a 5-node ES setup?

-anurag