Parallel document processing across nodes?


I've been playing around with making an elastic search plugin and have added a few endpoints to do things that es doesn't currently do like text summary. I was wondering if there was an interface/design pattern for processing multiple documents at the same time in parallel across es nodes. So I'm comfortable in getting a single document and performing things on it's text field/s but how would I go about(Or is it possible to) implementing an algorithm to process documents on multiple nodes in parallel?

Right now I can just call the multiple get request internally, but I assume that's going to be grabbing documents from different nodes and then does the processing on the single node which doesn't sound very efficient(Please correct me if I'm wrong, I'd love to have a better understanding of whats happening internally with the internal api).


Search in elasticsearch is distributed by its nature. The query phase of search is performed in parallel one thread per shard, up to the number of available threads from the query thread pool. The fetch phase is also done in parallel on each shard. I am not sure how your search plugin operates, so it is difficult for me to suggest anything wrt your plugin design. I can just mention that if your plugin is implemented as a script field it would be part of fetch phase and therefore it will be executed in parallel.

Right now I can just call the multiple get request internally,

I am not sure why you implemented it as a plugin then? Wouldn't it be easier to implement it as an application outside of elasticsearch?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.