Hello,
I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some
geographical information into an elasticsearch index. With our current
params, the loading is taking around 20-25 minutes to run, but we think it
should be faster. Are these numbers similar to what other users are
getting? Do you have any hints on how to get better performance? Any help
will be appreciated. Please find the details below.
Our ES cluster is version 1.1.1 with 11 nodes, and we are using
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the
numbers of reducers to 11. Other params we use are:
es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=10000
The average doc size is 1.3Kb, and each doc contains a "bbox" field with
the shape definition like this:
"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}
We are using the following mapping for this index, because these are the 3
fields of our docs we are more interested in:
{
"properties": {
"bbox": {
"precision": "10m",
"tree": "quadtree",
"type": "geo_shape"
},
"id": {
"type": "string",
"index": "not_analyzed"
},
"streets": {
"type": "string"
}
}
}
This is a typical output of the MapReduce job:
14/11/17 09:05:44 INFO mapred.JobClient: Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Network Total Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0
Thanks,
Xavier.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.