Bulk load performance

Hello,

I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some
geographical information into an elasticsearch index. With our current
params, the loading is taking around 20-25 minutes to run, but we think it
should be faster. Are these numbers similar to what other users are
getting? Do you have any hints on how to get better performance? Any help
will be appreciated. Please find the details below.

Our ES cluster is version 1.1.1 with 11 nodes, and we are using
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the
numbers of reducers to 11. Other params we use are:

es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=10000

The average doc size is 1.3Kb, and each doc contains a "bbox" field with
the shape definition like this:

"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}

We are using the following mapping for this index, because these are the 3
fields of our docs we are more interested in:

{
"properties": {
"bbox": {
"precision": "10m",
"tree": "quadtree",
"type": "geo_shape"
},
"id": {
"type": "string",
"index": "not_analyzed"
},
"streets": {
"type": "string"
}
}
}

This is a typical output of the MapReduce job:

14/11/17 09:05:44 INFO mapred.JobClient: Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Network Total Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0

Thanks,
Xavier.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On the index settings side, you can dynamically turn off the index
refresh_interval and also reduce the number of shard replicas for the
duration of the bulk import.

Described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk

On Wed, Nov 19, 2014 at 2:53 AM, xaviertrujillo111@gmail.com wrote:

Hello,

I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some
geographical information into an elasticsearch index. With our current
params, the loading is taking around 20-25 minutes to run, but we think it
should be faster. Are these numbers similar to what other users are
getting? Do you have any hints on how to get better performance? Any help
will be appreciated. Please find the details below.

Our ES cluster is version 1.1.1 with 11 nodes, and we are using
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the
numbers of reducers to 11. Other params we use are:

es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=10000

The average doc size is 1.3Kb, and each doc contains a "bbox" field with
the shape definition like this:

"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}

We are using the following mapping for this index, because these are the 3
fields of our docs we are more interested in:

{
"properties": {
"bbox": {
"precision": "10m",
"tree": "quadtree",
"type": "geo_shape"
},
"id": {
"type": "string",
"index": "not_analyzed"
},
"streets": {
"type": "string"
}
}
}

This is a typical output of the MapReduce job:

14/11/17 09:05:44 INFO mapred.JobClient: Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Network Total
Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0

Thanks,
Xavier.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Nick Canzoneri
Developer, Wildbit http://wildbit.com/
Beanstalk http://beanstalkapp.com/, Postmark http://postmarkapp.com/,
dploy.io

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKWm5yPDSs_PABPi7Ydnr0h8utGAwOTOJuyDvEBm4fNMLG-Sqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you Nick, I tried that but I didn't see a noticeable performance
improvement.

Also, I tried setting the number of replicas to "0", load the data, then
put it back to "5", but this is causing some problems with our health check
scripts, because the index is very large, and the shards seems to be in
"INITIALIZING" status forever.

Regards.

On Wednesday, November 19, 2014 7:47:10 AM UTC-8, Nick Canzoneri wrote:

On the index settings side, you can dynamically turn off the index
refresh_interval and also reduce the number of shard replicas for the
duration of the bulk import.

Described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk

On Wed, Nov 19, 2014 at 2:53 AM, <xaviertr...@gmail.com <javascript:>>
wrote:

Hello,

I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some
geographical information into an elasticsearch index. With our current
params, the loading is taking around 20-25 minutes to run, but we think it
should be faster. Are these numbers similar to what other users are
getting? Do you have any hints on how to get better performance? Any help
will be appreciated. Please find the details below.

Our ES cluster is version 1.1.1 with 11 nodes, and we are using
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the
numbers of reducers to 11. Other params we use are:

es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=10000

The average doc size is 1.3Kb, and each doc contains a "bbox" field with
the shape definition like this:

"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}

We are using the following mapping for this index, because these are the
3 fields of our docs we are more interested in:

{
"properties": {
"bbox": {
"precision": "10m",
"tree": "quadtree",
"type": "geo_shape"
},
"id": {
"type": "string",
"index": "not_analyzed"
},
"streets": {
"type": "string"
}
}
}

This is a typical output of the MapReduce job:

14/11/17 09:05:44 INFO mapred.JobClient: Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Network Total
Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0

Thanks,
Xavier.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Nick Canzoneri
Developer, Wildbit http://wildbit.com/
Beanstalk http://beanstalkapp.com/, Postmark http://postmarkapp.com/,
dploy.io

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d5bfe04-50a6-497a-8370-642fa0ed56ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.