Hey all,
We're bumping up against a production problem I could use a hand with.
We're experiencing steadily decreasing index speeds. We have 12 c3.4xl
data nodes, and 1 c3.8xl master node (with 2 backups that are smaller).
We're indexing 45 million documents into a single index. Single shard
only, no replicas. As our number of documents grow, our indexing speed
slows to a crawl. We've applied all the standard mlockall, ulimit, and ssd
merge throttling tuning settings, so I feel our cluster is pretty good.
When I inspected the data, I've noticed our user is adding a new field on
every document. When I view the pending tasks on our master, the task
queue is always at least 300+ attempting to perform dynamic mapping. I've
also checked segment merging, we never have more than 1 merge going on, and
even then it lasts for a second or two, not long at all.
This brings me to my question. When dynamic mapping is performed, is this
on the master only? Obviously this would introduce a bottleneck, and
explain our sudden performance drop. I'm at a loss to explain this issue.
Any advice would be appreciated.
Hey all,
We're bumping up against a production problem I could use a hand with.
We're experiencing steadily decreasing index speeds. We have 12 c3.4xl
data nodes, and 1 c3.8xl master node (with 2 backups that are smaller).
We're indexing 45 million documents into a single index. Single shard
only, no replicas. As our number of documents grow, our indexing speed
slows to a crawl. We've applied all the standard mlockall, ulimit, and ssd
merge throttling tuning settings, so I feel our cluster is pretty good.
When I inspected the data, I've noticed our user is adding a new field on
every document. When I view the pending tasks on our master, the task
queue is always at least 300+ attempting to perform dynamic mapping. I've
also checked segment merging, we never have more than 1 merge going on, and
even then it lasts for a second or two, not long at all.
This brings me to my question. When dynamic mapping is performed, is this
on the master only? Obviously this would introduce a bottleneck, and
explain our sudden performance drop. I'm at a loss to explain this issue.
Any advice would be appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.