Still have Indexing Question

Hi,

So I have a two node cluster, 1 master and 1 data node on two
different servers.

I have one index that has 1 replicas and 5 shards.

I'd like to have my indexing impact searching as little as possible.
What I'm noticing is that indexing is hitting the data node server
pretty hard in terms of CPU
which I wouldn't really expect.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us
sy id wa st
2 0 0 18202 100 8703 0 0 0 8 5953 2953 87
6 2 6 0
5 0 0 18077 100 8828 0 0 0 4 6654 4574 85
13 3 0 0
4 0 0 18235 100 8669 0 0 0 0 9425 6826 71
12 7 10 0
4 0 0 18185 100 8718 0 0 0 4 6787 3080 90
9 1 0 0
3 1 0 18143 100 8760 0 0 0 0 4477 5218 92
6 1 0 0
1 0 0 18110 100 8789 0 0 0 0 5116 5751 79
6 13 1 0

What I thought would happen is that I could index all the data on the
master than propagate to the nodes. I understand that elastic search
is more realtime and replicates everything immediately.

I therefore think I'm missing something. I'm sure that people have
ran into the problem where they don't want indexing to impact search
performance. What is the best way to do that with elastic search?

Is it because I need to change my transaction log and queue more
documents before things get flushed? Is there another configuration
I'm missing.

Thanks,

Neil

To add some more data:

I have about 80K documents that total 5.3GB in size once indexed and
optimized in elastic search.
If I isolate the indexing to one box it takes 35 minutes to index all
the documents and completely hammers the CPU
which is ok since its isolate.

I really don't need the realtime addition of documents, in fact given
the hardware constraints I have I'd prefer to just
bulk load on an index node and then somehow replicate it out to other
nodes.

I do need to have nodes that just handle searching up 100% of the
time.

Thanks,

Neil

I think you confuse things when it comes to master and data nodes. Why do
you think the master node handles the indexing, cause it doesn't? An elected
master node in the cluster just acts as the coordinator cluster wise, has
nothing to do with actual data you index.

What exactly are you after. Bulk indexing data into a new index, and then
making it searchable?

On Thu, Sep 29, 2011 at 5:07 AM, Neil neilmatthewlott@gmail.com wrote:

To add some more data:

I have about 80K documents that total 5.3GB in size once indexed and
optimized in Elasticsearch.
If I isolate the indexing to one box it takes 35 minutes to index all
the documents and completely hammers the CPU
which is ok since its isolate.

I really don't need the realtime addition of documents, in fact given
the hardware constraints I have I'd prefer to just
bulk load on an index node and then somehow replicate it out to other
nodes.

I do need to have nodes that just handle searching up 100% of the
time.

Thanks,

Neil

What exactly are you after. Bulk indexing data into a new index, and then
making it searchable?

Yes -- exactly. I did try using the bulk indexing using the bulk
request operations with about 750 documents
a time.

And given my hardware constraints I wanted to isolate the indexing
somehow so it doesn't affect the search performance
of the cluster.

I see. Well, there isn't an option to do that in 0.17, but, in master, you
have the option to control which nodes an index will be allocated on (based
on custom node attributes that you define).

So, you can create an index, and have it allocated only on "indexing" nodes
(for example, nodes with attribute of node.indexing set to true), and, once
the indexing is done, you can dynamically change the filtering allocation
for that index to be allowed to be allocated on search nodes.

Also, you can create the index with an initial number_of_replicas set to 0,
and once indexing is done, increase it to the number of replicas you want.
This will reduce the amount fo indexing that needs to happen.

On Mon, Oct 3, 2011 at 9:42 PM, Neil neilmatthewlott@gmail.com wrote:

What exactly are you after. Bulk indexing data into a new index, and then
making it searchable?

Yes -- exactly. I did try using the bulk indexing using the bulk
request operations with about 750 documents
a time.

And given my hardware constraints I wanted to isolate the indexing
somehow so it doesn't affect the search performance
of the cluster.