I would like some clarification on this topic, Assume I am on cloud with
two instances (instance1 & instance2).
instance1 is where I process all my files and make them JSON ready
instance2 is where the actual Elasticsearch server is.
So On Instance1, I can run my application which opens a transport client
with (instance2 IP address), and go through my documents and push them to
BulkProcessor.
My question, wouldn't the bulkprocessor still have to push the data
remotely to instance2 where the actual ES server is, so with each bulk, say
its 1000 docs, it has to push 1000 docs? Isn't this a huge bottle neck?
What's the ideal way for such a scenario? Am I going about this the wrong
way?
Correct. You will be network bound when pushing content remotely. However,
how is this different from any storage engine? If Instance2 was running
MySQL, you are still pushing data over the network.
The most efficient solution perhaps is to run your process as a river, but
you lose some flexibility in terms of process control. Overall, I think the
remote indexing process is a good solution.
I would like some clarification on this topic, Assume I am on cloud with
two instances (instance1 & instance2).
instance1 is where I process all my files and make them JSON ready
instance2 is where the actual Elasticsearch server is.
So On Instance1, I can run my application which opens a transport client
with (instance2 IP address), and go through my documents and push them to
BulkProcessor.
My question, wouldn't the bulkprocessor still have to push the data
remotely to instance2 where the actual ES server is, so with each bulk, say
its 1000 docs, it has to push 1000 docs? Isn't this a huge bottle neck?
What's the ideal way for such a scenario? Am I going about this the wrong
way?
Thanks Ivan, Yes it's no different I just wanted to make sure I am going
about it the right way. The only thing that concerns me is the bulks not
keeping up due to networking, pushing and indexing is heavy on I/O, while
storage scenario you mentioned may not be as time sensitive operation (I
think).
I thought about having the index server on the same machine, and just make
the machine to be powerful one, I am just not sure what are the drawbacks
to this? I can think of one, remotely it allows me to setup a cluster of
machines to which I can index to.
On Thursday, April 24, 2014 3:46:59 PM UTC-4, Ivan Brusic wrote:
Correct. You will be network bound when pushing content remotely. However,
how is this different from any storage engine? If Instance2 was running
MySQL, you are still pushing data over the network.
The most efficient solution perhaps is to run your process as a river, but
you lose some flexibility in terms of process control. Overall, I think the
remote indexing process is a good solution.
Cheers,
Ivan
On Thu, Apr 24, 2014 at 11:55 AM, IronMan2014 <sabda...@gmail.com<javascript:>
wrote:
I would like some clarification on this topic, Assume I am on cloud with
two instances (instance1 & instance2).
instance1 is where I process all my files and make them JSON ready
instance2 is where the actual Elasticsearch server is.
So On Instance1, I can run my application which opens a transport client
with (instance2 IP address), and go through my documents and push them to
BulkProcessor.
My question, wouldn't the bulkprocessor still have to push the data
remotely to instance2 where the actual ES server is, so with each bulk, say
its 1000 docs, it has to push 1000 docs? Isn't this a huge bottle neck?
What's the ideal way for such a scenario? Am I going about this the wrong
way?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.