We are using Elasticsearch to provide full-text search on our various web sites. Our search platform works like Bing or Google to return all relevant documents with Title, Description, Url from our own wet sites for a given search phrase on search box.
We also built our BI platform to collect all user query/click event on the search result page and use the query/click info of each page as part of relevance tuning approach.
The query click of each page is array of query term and corresponding click counts on the page.
An example document with query click info is something like this:
{
"Title": "Elasticsearch 5.2.0 is released!",
"Description": "Say heiya to 5.2.0, we are pleased to announce that Elasticsearch 5.2.0 is released!",
"Url": "http://www.elastic.com/release/5.2.0",
"QueryClicks": [
{ "Term": "elasticsearch", Count: 200 },
{ "Term": "elastic", Count: 100}
]
}
We have ingestion pipeline to incremental crawl the updated/added/deleted pages and ingest the basic doc info (Title, Description and Url) into Es index.
Incremental crawl means, we only ingest the document when it is added/updated/deleted in our CMS system, instead of using scheduled job to re-crawl all pages and re-index the whole index.
Meanwhile, to make sure the index is updated with consistent data when some docs are updated/deleted at same time on CMS, we use the last modified time of the doc as external version when indexing the document to make sure the newer version of document is inserted into Index.
The query clicks info, is collected and processed in a separate pipeline which monthly query latest one month BI data and update the whole Index with newest query clicks info.
Now comes the problem:
Since both crawling and query clicks pipeline need partially update the index so we should use Update API instead of Index API on both pipelines.
But as I mentioned that we use external version to make sure newer docs are indexed into ES in crawling pipeline, however, Update API does NOT support external version.
One workaround I can think out is, that before index the documents, explicitly get all docs to be indexed and update the docs with new data (example, query click info) one by one, and then index docs back to ES. This somehow works but performance is not good.
I am fine with this workaround on query clicks pipeline, but don't like it on crawling pipeline as we want the crawling as fast as possible.
Any other workaround to be able to partially update the document and meanwhile making sure newer docs are ingested successfully if same docs are updated at the same time (as we now implemented using last modified time as external version)?
Hope I explained my problem clearly.