I use workplace search to benefit from the connectors, in particular Sharepoint connector. I need to make some post-processing to my documents (cleaning, alignment, enrichment via an external API), so I use Logstash with workplace search's indices in input and I update these indices in output. But I noticed that everyday with each Full synchronization, I lose all my enrichments, during this synchronization in Workplace Search the documents do not seem to be updated but created again. Would you have a solution to keep all my changes (other than using a new index in output) ?
I applaud your ingenuity of using Logstash to apply enrichment and post-processing to the documents indexed by Workplace Search. However, as you've observed, the full syncs don't just apply "changes", but fully refresh the content source from scratch - overwriting existing data.
You have a few potential options.
You could sync once, then disable content source syncing with: workplace_search.content_source.sync.enabled: false. This would disable all syncs for all your content sources though, and then you wouldn't get any updates. Seems unlikely you'd want that.
You could use Logstash to instead feed into a Custom API Source. You could then make the original sharepoint source non-searchable, so that you don't get duplicate results in your UI.
You could abandon using the out-of-the-box sharepoint connector entirely, and just use the Custom API Source approach, using custom-written extraction and post-processing code. While this option is a lot more work, it will probably be more stable over time, since we don't guarantee the stability of Workplace Search's underlying indexes - they are subject to changes that may not be backwards compatible to your logstash pipeline.
We've talked about adding custom post-processing capabilities, but have not baked that into the solution yet. If you have a support relationship with Elastic, you can file an Enhancement Request for it, to help bump up the priority.
Thank you very much for your answer
For now, as we are in the POC phase, I will use a new elastic index for my output. So my enrichments will be persistent.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.