Geo_shape indexing speed

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM to
16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards and
zero replicas. Note that even going from one node to four did not change
the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Mohamed,

How are you indexing those documents? Are you sending them using the bulk
API? Are you using threads o processes to send more in paralel? Are you
always talking to just one node?

The fact that you don't see increased performance with more nodes might
suggest that the bottleneck is actually your Python process. Try running
the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM to
16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards and
zero replicas. Note that even going from one node to four did not change
the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other numbers
but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get more
performance.

and yes, I am always talking to one node only... I will test running four
python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the bulk
API? Are you using threads o processes to send more in paralel? Are you
always talking to just one node?

The fact that you don't see increased performance with more nodes might
suggest that the bottleneck is actually your Python process. Try running
the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi <ml...@georgetown.edu<javascript:>

wrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM to
16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards and
zero replicas. Note that even going from one node to four did not change
the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Using all four nodes for indexing, instead of one, improved the over all
results, but not by much. Each node is taking somewhere between 60 and 100
seconds to index 5000 docs, instead of 30 to 60 seconds.
In my first post I said 500 documents, but that was a mistake, it is 5,000,
which is my PyES bulk_size.

Example script log:

2013/07/19 14:15:23: Checkpoint 45000 Time: [73.04 sec]
2013/07/19 14:16:35: Checkpoint 50000 Time: [69.91 sec]
2013/07/19 14:17:55: Checkpoint 55000 Time: [76.86 sec]
2013/07/19 14:19:14: Checkpoint 60000 Time: [76.40 sec]
2013/07/19 14:20:41: Checkpoint 65000 Time: [84.90 sec]
2013/07/19 14:21:59: Checkpoint 70000 Time: [74.73 sec]

Do these numbers sound too low? Any other suggestions?

Also, my Python script is simply reading already existing JSON strings off
of file... Would a bash script using curl to push the documents to ES be
faster? I have no idea where the bottleneck really is!

Thanks a lot,
Mohamed.

On Fri, Jul 19, 2013 at 12:59 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other
numbers but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get more
performance.

and yes, I am always talking to one node only... I will test running four
python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the bulk
API? Are you using threads o processes to send more in paralel? Are you
always talking to just one node?

The fact that you don't see increased performance with more nodes might
suggest that the bottleneck is actually your Python process. Try running
the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi ml...@georgetown.eduwrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.**371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM to
16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards and
zero replicas. Note that even going from one node to four did not change
the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could this change dramatically effect indexing speed AND index size?

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1m"},

to:

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1km"},

am trying it now, plus running 10 concurrent curl POSTs, and the process is
moving very fast (sorry, not collecting any metrics)...

Thanks,
Mohamed.

On Fri, Jul 19, 2013 at 2:46 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Using all four nodes for indexing, instead of one, improved the over all
results, but not by much. Each node is taking somewhere between 60 and 100
seconds to index 5000 docs, instead of 30 to 60 seconds.
In my first post I said 500 documents, but that was a mistake, it is
5,000, which is my PyES bulk_size.

Example script log:

2013/07/19 14:15:23: Checkpoint 45000 Time: [73.04 sec]
2013/07/19 14:16:35: Checkpoint 50000 Time: [69.91 sec]
2013/07/19 14:17:55: Checkpoint 55000 Time: [76.86 sec]
2013/07/19 14:19:14: Checkpoint 60000 Time: [76.40 sec]
2013/07/19 14:20:41: Checkpoint 65000 Time: [84.90 sec]
2013/07/19 14:21:59: Checkpoint 70000 Time: [74.73 sec]

Do these numbers sound too low? Any other suggestions?

Also, my Python script is simply reading already existing JSON strings off
of file... Would a bash script using curl to push the documents to ES be
faster? I have no idea where the bottleneck really is!

Thanks a lot,
Mohamed.

On Fri, Jul 19, 2013 at 12:59 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other
numbers but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get
more performance.

and yes, I am always talking to one node only... I will test running four
python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the
bulk API? Are you using threads o processes to send more in paralel? Are
you always talking to just one node?

The fact that you don't see increased performance with more nodes might
suggest that the bottleneck is actually your Python process. Try running
the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi ml...@georgetown.eduwrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.**371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM
to 16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards and
zero replicas. Note that even going from one node to four did not change
the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Absolutely, the precision has a huge impact on the amount of work es needs
to do to the data, lower precision reduces the cpu load significantly

On Sat, Jul 20, 2013 at 7:33 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Could this change dramatically effect indexing speed AND index size?

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1m"},

to:

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1km"},

am trying it now, plus running 10 concurrent curl POSTs, and the process
is moving very fast (sorry, not collecting any metrics)...

Thanks,
Mohamed.

On Fri, Jul 19, 2013 at 2:46 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Using all four nodes for indexing, instead of one, improved the over all
results, but not by much. Each node is taking somewhere between 60 and 100
seconds to index 5000 docs, instead of 30 to 60 seconds.
In my first post I said 500 documents, but that was a mistake, it is
5,000, which is my PyES bulk_size.

Example script log:

2013/07/19 14:15:23: Checkpoint 45000 Time: [73.04 sec]
2013/07/19 14:16:35: Checkpoint 50000 Time: [69.91 sec]
2013/07/19 14:17:55: Checkpoint 55000 Time: [76.86 sec]
2013/07/19 14:19:14: Checkpoint 60000 Time: [76.40 sec]
2013/07/19 14:20:41: Checkpoint 65000 Time: [84.90 sec]
2013/07/19 14:21:59: Checkpoint 70000 Time: [74.73 sec]

Do these numbers sound too low? Any other suggestions?

Also, my Python script is simply reading already existing JSON strings
off of file... Would a bash script using curl to push the documents to ES
be faster? I have no idea where the bottleneck really is!

Thanks a lot,
Mohamed.

On Fri, Jul 19, 2013 at 12:59 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other
numbers but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get
more performance.

and yes, I am always talking to one node only... I will test running
four python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the
bulk API? Are you using threads o processes to send more in paralel? Are
you always talking to just one node?

The fact that you don't see increased performance with more nodes might
suggest that the bottleneck is actually your Python process. Try running
the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi ml...@georgetown.eduwrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.**371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM
to 16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards
and zero replicas. Note that even going from one node to four did not
change the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Excellent, thanks a lot for confirming. I think 1km precision will do just
fine for my needs.

Mohamed.

On Sat, Jul 20, 2013 at 1:34 PM, Honza Král honza.kral@elasticsearch.comwrote:

Absolutely, the precision has a huge impact on the amount of work es needs
to do to the data, lower precision reduces the cpu load significantly

On Sat, Jul 20, 2013 at 7:33 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Could this change dramatically effect indexing speed AND index size?

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1m"},

to:

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1km"},

am trying it now, plus running 10 concurrent curl POSTs, and the process
is moving very fast (sorry, not collecting any metrics)...

Thanks,
Mohamed.

On Fri, Jul 19, 2013 at 2:46 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Using all four nodes for indexing, instead of one, improved the over all
results, but not by much. Each node is taking somewhere between 60 and 100
seconds to index 5000 docs, instead of 30 to 60 seconds.
In my first post I said 500 documents, but that was a mistake, it is
5,000, which is my PyES bulk_size.

Example script log:

2013/07/19 14:15:23: Checkpoint 45000 Time: [73.04 sec]
2013/07/19 14:16:35: Checkpoint 50000 Time: [69.91 sec]
2013/07/19 14:17:55: Checkpoint 55000 Time: [76.86 sec]
2013/07/19 14:19:14: Checkpoint 60000 Time: [76.40 sec]
2013/07/19 14:20:41: Checkpoint 65000 Time: [84.90 sec]
2013/07/19 14:21:59: Checkpoint 70000 Time: [74.73 sec]

Do these numbers sound too low? Any other suggestions?

Also, my Python script is simply reading already existing JSON strings
off of file... Would a bash script using curl to push the documents to ES
be faster? I have no idea where the bottleneck really is!

Thanks a lot,
Mohamed.

On Fri, Jul 19, 2013 at 12:59 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other
numbers but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get
more performance.

and yes, I am always talking to one node only... I will test running
four python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the
bulk API? Are you using threads o processes to send more in paralel? Are
you always talking to just one node?

The fact that you don't see increased performance with more nodes
might suggest that the bottleneck is actually your Python process. Try
running the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi ml...@georgetown.eduwrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.**371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM RAM
to 16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards
and zero replicas. Note that even going from one node to four did not
change the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For the record:

Indexed 75 million such documents in about 4 hours. Used some 65GB disk
space. Used 10 concurrent curl commands on one node, part of a four node
cluster.
Cluster nodes happened to be 4 big VMs: two of them had 16 CPUs and 200GB
RAM, two had 8 CPUs and 50 GB RAM.
JVM configured on all four nodes with max RAM of 16GB.

On Sat, Jul 20, 2013 at 1:38 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Excellent, thanks a lot for confirming. I think 1km precision will do just
fine for my needs.

Mohamed.

On Sat, Jul 20, 2013 at 1:34 PM, Honza Král honza.kral@elasticsearch.comwrote:

Absolutely, the precision has a huge impact on the amount of work es
needs to do to the data, lower precision reduces the cpu load significantly

On Sat, Jul 20, 2013 at 7:33 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Could this change dramatically effect indexing speed AND index size?

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1m"},

to:

u"geobox": {'type': 'geo_shape', "tree": "geohash", "precision": "1km"},

am trying it now, plus running 10 concurrent curl POSTs, and the process
is moving very fast (sorry, not collecting any metrics)...

Thanks,
Mohamed.

On Fri, Jul 19, 2013 at 2:46 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Using all four nodes for indexing, instead of one, improved the over
all results, but not by much. Each node is taking somewhere between 60 and
100 seconds to index 5000 docs, instead of 30 to 60 seconds.
In my first post I said 500 documents, but that was a mistake, it is
5,000, which is my PyES bulk_size.

Example script log:

2013/07/19 14:15:23: Checkpoint 45000 Time: [73.04 sec]
2013/07/19 14:16:35: Checkpoint 50000 Time: [69.91 sec]
2013/07/19 14:17:55: Checkpoint 55000 Time: [76.86 sec]
2013/07/19 14:19:14: Checkpoint 60000 Time: [76.40 sec]
2013/07/19 14:20:41: Checkpoint 65000 Time: [84.90 sec]
2013/07/19 14:21:59: Checkpoint 70000 Time: [74.73 sec]

Do these numbers sound too low? Any other suggestions?

Also, my Python script is simply reading already existing JSON strings
off of file... Would a bash script using curl to push the documents to ES
be faster? I have no idea where the bottleneck really is!

Thanks a lot,
Mohamed.

On Fri, Jul 19, 2013 at 12:59 PM, Mohamed Lrhazi ml623@georgetown.eduwrote:

Hi Honza,

Yes, I am using the bulk API, with a bulk_size of 5000. Tried other
numbers but 5k seemed to give best results.

I will rewrite my Python to use multi processing ans see if I can get
more performance.

and yes, I am always talking to one node only... I will test running
four python instances one on each node.

Mohamed.

On Friday, July 19, 2013 12:50:05 PM UTC-4, Honza Král wrote:

Hi Mohamed,

How are you indexing those documents? Are you sending them using the
bulk API? Are you using threads o processes to send more in paralel? Are
you always talking to just one node?

The fact that you don't see increased performance with more nodes
might suggest that the bottleneck is actually your Python process. Try
running the indexing in paralel to determine that.

On Fri, Jul 19, 2013 at 6:41 PM, Mohamed Lrhazi <ml...@georgetown.edu

wrote:

Hello,

I need to index lots of documents in the following format:

source: {
geobox: {
type: envelope
coordinates: [[-78.538,-0.363],[-78.53,-0.**371]]
}
country: EC
name: -0.363
-78.538_-0.371_-78.53
population: 231
}

My type is like this:

"cell" : {
"properties" : {
u"name": {'index': 'not_analyzed','type': 'string',},
u"country": {'index': 'not_analyzed','type': 'string',},
u"population": {'type': 'integer',},
u"geobox": {'type': 'geo_shape', "tree": "geohash",
"precision": "1m"},
}
}

I tried many tuning suggestions I found, including increasing JVM
RAM to 16GB, mlockall: true...

None of my tuning changes seem to cause any dramatic improvement in
indexing performance. I seem to always index 500 documents somewhere
between 30 and 60 seconds....

am indexing on one VM, part of a 4 node cluster. index has 4 shards
and zero replicas. Note that even going from one node to four did not
change the indexing speed that much!

am using Python and PyES (thrift) for indexing.

Can I hope for much better performance with somehow? what would you
suggest? or is this good enough for geo_shape indexing?

Thanks a lot,
Mohamed.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OXJVD-9HtUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.