Index content on one machine, send index to other machine?

Hi, I am pretty new to elasticsearch, so I'm sorry if this is a trivial question. I am trying to figure out if it is possible to have a setup like this:

Machine A has a large corpus of documents. Machine B is running elasticsearch.

Machine A scans it's corpus of documents and runs the elasticsearch full-text indexing procedure on that content. Machine A sends the compact indexed version of the documents to Machine B. Machine B inserts the index provided by Machine A into it's elasticsearch instance. Machine B is now able to search for documents on Machine A and return IDs representing files from Machine A. Machine B has never seen the full content of Machine A's files.

By doing it this way I do not need to move the data around as much and I would probably save a lot of bandwidth and time.

Is this at all possible, or does Elasticsearch require the entire content to be on the same cluster/machine as is running the elasticsearch instance? I have not been able to find any information about this anywhere, possibly due to my lack of understanding.

If anyone could enlighten me on this subject that would be great! :smile:

1 Like

This is not how ES works, you need to send the complete data to it to be able to search.

Thanks for the quick reply! Is it possible to elaborate a bit?

Is this not possible because there are no APIs for it? Or is there some inherent attributes with the way the indexing is performed that requires all the content to be at the same location as where the search will be performed.

I have found that you can set "store" to "no" and "index" to "yes", so the content actually does not stick around as far as I understand (related stackoverflow: http://stackoverflow.com/questions/17103047/why-do-i-need-storeyes-in-elasticsearch). This leads me to think that the approach I am describing possibly could be done.

But again, I am an elasticsearch noob. It is quite likely that this isn't possible to do, but I would really like to understand why. If is is as simple as "there are currently no api for it" I could maybe look into making something myself.

You still need to ship all the data over to ES for it to be indexed, irrespective of it being stored or not.

The only way to index the data on machine A would be to install ES on it. If you then wanted to move that data to machine B you can. That may say you bandwidth, but it seems like a lot of work for little return.

So, it would be possible to install ES on machine A, have that index a document, and then transfer the index of just that document to machine B? Or are you saying that this approach would have to transfer the whole ES database from machine A to machine B?

ES isn't a database. However conceptually you are right.

If you install ES on A and then index that data, you need to copy that data to B if that is where you want to query it.