So I have a two node cluster, 1 master and 1 data node on two
different servers.
I have one index that has 1 replicas and 5 shards.
I'd like to have my indexing impact searching as little as possible.
What I'm noticing is that indexing is hitting the data node server
pretty hard in terms of CPU
which I wouldn't really expect.
What I thought would happen is that I could index all the data on the
master than propagate to the nodes. I understand that elastic search
is more realtime and replicates everything immediately.
I therefore think I'm missing something. I'm sure that people have
ran into the problem where they don't want indexing to impact search
performance. What is the best way to do that with elastic search?
Is it because I need to change my transaction log and queue more
documents before things get flushed? Is there another configuration
I'm missing.
I have about 80K documents that total 5.3GB in size once indexed and
optimized in elastic search.
If I isolate the indexing to one box it takes 35 minutes to index all
the documents and completely hammers the CPU
which is ok since its isolate.
I really don't need the realtime addition of documents, in fact given
the hardware constraints I have I'd prefer to just
bulk load on an index node and then somehow replicate it out to other
nodes.
I do need to have nodes that just handle searching up 100% of the
time.
I think you confuse things when it comes to master and data nodes. Why do
you think the master node handles the indexing, cause it doesn't? An elected
master node in the cluster just acts as the coordinator cluster wise, has
nothing to do with actual data you index.
What exactly are you after. Bulk indexing data into a new index, and then
making it searchable?
I have about 80K documents that total 5.3GB in size once indexed and
optimized in Elasticsearch.
If I isolate the indexing to one box it takes 35 minutes to index all
the documents and completely hammers the CPU
which is ok since its isolate.
I really don't need the realtime addition of documents, in fact given
the hardware constraints I have I'd prefer to just
bulk load on an index node and then somehow replicate it out to other
nodes.
I do need to have nodes that just handle searching up 100% of the
time.
I see. Well, there isn't an option to do that in 0.17, but, in master, you
have the option to control which nodes an index will be allocated on (based
on custom node attributes that you define).
So, you can create an index, and have it allocated only on "indexing" nodes
(for example, nodes with attribute of node.indexing set to true), and, once
the indexing is done, you can dynamically change the filtering allocation
for that index to be allowed to be allocated on search nodes.
Also, you can create the index with an initial number_of_replicas set to 0,
and once indexing is done, increase it to the number of replicas you want.
This will reduce the amount fo indexing that needs to happen.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.