Long-running BulkProcessor


(Ron) #1

For our index we are regularly updating it with individual document index requests, and to improve performance I want to move to a bulk/batch processing model. This is a Java app.

For the Java BulkProcessor, is it acceptable to have a single bulk processor allocated at startup (and closed at shutdown) that is running constantly, accepting single index updates that will be processed asynchronously by the BulkProcessor? I'd envision setting it up so that it would flush when we hit either 25 documents queued or say 5 minutes elapsed.

Is this appropriate use of the BulkProcessor?


(Jörg Prante) #2

25 documents is a very low number because with BulkProcessor, you are expected to index thousands of documents per second. Also, 5 minutes is a long duration, the default flush interval is 5 seconds, for a reason.

You should close BulkProcessor after an intensive run after thousands or millions documents in order to flush the last documents properly and wait for them being indexed. This may be important to have something like a checkpoint to be sure that queries can search over the whole set of documents. For subsequent actions, you can simply instantiate a new BulkProcessor.

Not sure about your workload, but, if you index very few documents in a time span of minutes, using bulk indexing is rather questionable. It is easier to send them with IndexRequest.


(Ron) #3

It's easier for sure, I just didn't know if maybe there was something to be gained by having a bulkprocessor running in the background almost as a service of some sort.

I'll just plan on sending individual IndexRequests and avoid over-engineering it :laughing:


(system) #4