We're trying to improve the performance of a Bulk indexing in a
cluster. The problem here is that dynamic update mapping slows the
operations when the index grows.
Setting refresh_interval to -1 doesn't work as still dynamic update
mapping goes to other cluster nodes.
Setting number_of_replicas to 0 doesn't work as still shards get
distributed accross nodes so dynamic update mapping get slow.
So we wanted to move all shards to one node as we saw that indexing
those data in a one-node setting is very fast.
I am also having issues with bulk indexing to a cluster. The system
begins to crawl as the index gets larger. And just like you, I set the
number of replicas to zero and disable the refresh interval. My
project is still under exploration, so I am only using two nodes.
My question is how were you able to determine that dynamic update
mappings are the cause of your problems? I have not paid attention to
the network chatter between the two boxes, but I am wondering if I
should.
We're trying to improve the performance of a Bulk indexing in a
cluster. The problem here is that dynamic update mapping slows the
operations when the index grows.
Setting refresh_interval to -1 doesn't work as still dynamic update
mapping goes to other cluster nodes.
Setting number_of_replicas to 0 doesn't work as still shards get
distributed accross nodes so dynamic update mapping get slow.
So we wanted to move all shards to one node as we saw that indexing
those data in a one-node setting is very fast.
First, regarding the allocation, are you trying to filter based on the randomized/explicitly set node name? If so, you should use _name as the attribute value, not name.
Regarding bulk being slow, how did you came up with the fact that updating the mapping slows it down? Its an async process (updating the mapping). Do you have a case where each new bulk item has new fields?
On Thursday, February 16, 2012 at 12:20 AM, Jorge Urdaneta wrote:
Hi,
We're trying to improve the performance of a Bulk indexing in a
cluster. The problem here is that dynamic update mapping slows the
operations when the index grows.
Setting refresh_interval to -1 doesn't work as still dynamic update
mapping goes to other cluster nodes.
Setting number_of_replicas to 0 doesn't work as still shards get
distributed accross nodes so dynamic update mapping get slow.
So we wanted to move all shards to one node as we saw that indexing
those data in a one-node setting is very fast.
We got no replicas (as we requested) but still shards are distributed
ignoring the setting for routing.allocation.include.name (http://routing.allocation.include.name).
We need to do this for only one index. We noticed cluster api also
allows decomision of specific nodes. But we need other nodes to
continue working.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.