There is no way to setPipeline on UpdateRequest API. So tried the below and sadly it doesn't work.
The existing document is being updated but without the pipeline transformations applied.
Please can the experts suggest some solution or alternatives.
IndexRequest request = new IndexRequest(indexConfig.getIndexName(), indexConfig.getIndexType(), docId)
.source(source);
request.setPipeline(indexConfig.getPipeline());
if (appConfig.isUpdateRequest()) {
UpdateRequest upsertRequest = new UpdateRequest(indexConfig.getIndexName(), indexConfig.getIndexType(),
docId).doc(source).upsert(request);
bulkProcessor.add(upsertRequest);
} else {
bulkProcessor.add(request);
}
We perform bulk indexing all the time. First we do an initial bulk indexing with pipelines. After that we so delta indexing again on bulk mode and here we need the same pipelines applied so thar the end data on the documents are the same as the one during initial indexing.
No support for pipelines on bulk update means I have to either call update by query with pipeline post index update or remove pipeline altogether and stick the pipeline logic in the code which is bad.
Initial indexing (bulk / insert / IndexRequest) - We pull all entities to be indexed from an application REST endpoint.
Delta indexing (bulk / update / UpdateRequest with docAsUpsert) - Here we pull all entities created or modified as of a given point in time. In response, we may get nothing or may get even a million entities. In this scenario, we will have to update documents if they already exists or create new ones if they don't.
So we have to apply the same pipelines in both routes. This way the field values are in-tact.
What's happening now is:
Initial indexing - field named 'status' is converted to lowercase via lowercase processor. The values are active & inactive.
Delta indexing - as pipeline is not applied, the value in the status field changes to ACTIVE / INACTIVE / Active / Inactive etc.
Yes, we can workaround this but I strongly feel you must consider supporting pipeline on Bulk UpdateRequest.
Discussed during fix it friday and this looks like a useful enhancement, but there are corner cases which would make it very tricky to support this. (index name or routing is changed during ingestion or when a node isn't allowed to run ingest) Therefor I'm closing this issue and we can re-evaluate this at a later time if this is still useful and the technical concerns can fixed easily.
A workaround could be may be using the update by query instead as this is supporting ingest. With the price of slowness...
Another workaround would be to simulate that by yourself by calling the _simulate ingest endpoint and then send the result using the update API.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.