Sorry to ask the same question again, as the thread I asked for the same question was locked.
We are now using Elasticsearch 5.2.0 to provide full-text search on our various web sites. Our search platform works just like Bing or Google to return all relevant documents with Title, Description, Url from our own web sites for a given search phrase on search box.
We have ingestion service to incrementally crawl the updated/added/deleted pages from our CMS system and ingest the basic doc information (Title, Description and Url) into ES index.
Incremental crawl means, we only ingest the document when it is added/updated/deleted in our CMS system, instead of using a scheduled job to re-crawl all pages and re-index the whole index.
The docs updated in CMS are asynchronously processed by sending them to a distributed messaging system (like Kafka) which received by ingestion service and then add/update/delete to/from Elasticsearch index. Since same doc could be updated/deleted more than once in CMS and the docs are processed asynchronously, it is possible that for a certain document, the older version doc might be received by crawling service earlier than the newer doc.
To make sure that only newer version of document could be inserted into Index, we use the last modified time of the doc as external version when index/delete the document.
Note: the last modified time format is yyyyMMddHHmmssfff Example: 20170202091010123 (means 2017-02-02.09:10:10.123)
So far so good.
To fine tune the relevance based on end users's behavior on our search result page, we built our BI platform to collect all user query/click events on the search result page and use the query/click data for relevance tuning.
The query click of each page is just an array of query term by which user got the page and the corresponding click counts on the page.
Then the document with query click info will be something like this:
{
"Title": "Elasticsearch 5.2.0 is released!",
"Description": "Say heiya to 5.2.0, we are pleased to announce that Elasticsearch 5.2.0 is released!",
"Url": "http://www.elastic.com/release/5.2.0",
"QueryClicks": [
{ "Term": "elasticsearch", Count: 200 },
{ "Term": "elastic", Count: 100}
]
}
The query clicks info, will be collected and processed in a separate service, named queryclicks service, which monthly retrieve latest one month query clicks events from our BI system and update the whole Index with newest query clicks info.
Now comes the problem:
Since both ingestion and queryclicks service need partially update the index, we should use Update API instead of Index API on both services.
And as I mentioned that right now we use external version to make sure only newer docs could be indexed into ES in ingestion service.
But the problem is, the Update API does NOT support external version.
One workaround I can think out is, that we still use the Index API, but before index the documents, we firstly get all docs to be indexed into ES via IDs, and the update the docs with new data (example, query click info) one by one, and then index docs back to ES. This somehow works but performance is not good.
Someone suggested another approach which save external version in a document field and use scripted update to check the version when update the index.
Example:
POST test/type1/1/_update
{
"script" : {
"inline": "if (params.lastModified > ctx._source.lastModified) { ctx._source.Title = params.Title, ctx._source.Description = params.Description, ctx._source.Url = params.Url } else { ctx.op = \"none\" }",
"lang": "painless",
"params" : {
"lastModified" : "[lastmodifed time]",
"Title": "new title value...",
"Description": "new description value...",
"Url": "http://example.com"
}
}
}
But the problem is, that the DELETE doc scenarios could not be handled well using scripted update approach.
Considering this scenario:
- CMS updated one doc on time t1 and sent the event to Event Hu
- CMS deleted the same doc on time t2 (t2 > t1) and sent the event to Event Hub
- The doc delete event (from step 2) was received by Ingestion service earlier than doc update event (from step 1)
- Ingestion service sent DELETE doc API to elasticsearch
- The doc was deleted from Elasticsearch
- Ingestion service update the doc using scripted update
With scripted update approach, step 6 will insert the doc as a new added doc successfully, that is not what we want.
Whereas with external version approach (of course, INDEX API will be used in step 6), the step 6 will fail with version conflict exception, that is perfectly what we expected.
Any better solution to be able to partially update the document and meanwhile making sure only newer docs could be indexed successfully if same doc are updated at the same time (as we now implemented using last modified time as external version)?