Partial update documents while enable the functionality that the external version provides


(Xudong You) #1

We are using Elasticsearch to provide full-text search on our various web sites. Our search platform works like Bing or Google to return all relevant documents with Title, Description, Url from our own wet sites for a given search phrase on search box.

We also built our BI platform to collect all user query/click event on the search result page and use the query/click info of each page as part of relevance tuning approach.

The query click of each page is array of query term and corresponding click counts on the page.

An example document with query click info is something like this:

{
    "Title": "Elasticsearch 5.2.0 is released!",
    "Description": "Say heiya to 5.2.0, we are pleased to announce that Elasticsearch 5.2.0 is released!",
    "Url": "http://www.elastic.com/release/5.2.0",
    "QueryClicks": [
        { "Term": "elasticsearch", Count: 200 },
        { "Term": "elastic", Count: 100}
    ]
}

We have ingestion pipeline to incremental crawl the updated/added/deleted pages and ingest the basic doc info (Title, Description and Url) into Es index.
Incremental crawl means, we only ingest the document when it is added/updated/deleted in our CMS system, instead of using scheduled job to re-crawl all pages and re-index the whole index.

Meanwhile, to make sure the index is updated with consistent data when some docs are updated/deleted at same time on CMS, we use the last modified time of the doc as external version when indexing the document to make sure the newer version of document is inserted into Index.

The query clicks info, is collected and processed in a separate pipeline which monthly query latest one month BI data and update the whole Index with newest query clicks info.

Now comes the problem:

Since both crawling and query clicks pipeline need partially update the index so we should use Update API instead of Index API on both pipelines.

But as I mentioned that we use external version to make sure newer docs are indexed into ES in crawling pipeline, however, Update API does NOT support external version.

One workaround I can think out is, that before index the documents, explicitly get all docs to be indexed and update the docs with new data (example, query click info) one by one, and then index docs back to ES. This somehow works but performance is not good.

I am fine with this workaround on query clicks pipeline, but don't like it on crawling pipeline as we want the crawling as fast as possible.

Any other workaround to be able to partially update the document and meanwhile making sure newer docs are ingested successfully if same docs are updated at the same time (as we now implemented using last modified time as external version)?

Hope I explained my problem clearly.


(Nik Everett) #2

You can still use the index API if you really want to. The update API is designed to save you some round trips and provides some nice things like retrying on a conflict.

Right. Your problem is that you have two external versions - the one from your CMS and the one from the update pipeline. Maybe you should combine them - make the CMS version the low 48 bits and the month the high 16 bits. Or give up on using external versions and use elasticsearch internal versions and use two fields in the document to represent the two versions (CMS version and click tracking version). You can't use the builtin conflict resolution but it sounds like you could implement turn on retries on the updates and probably be ok there.


(Xudong You) #3

Thanks Nik.

To clarify, we have only one external version that is the "last modified of the doc in CMS". When indexing the doc in crawling pipeline, we pass this version in index API url.

And in query clicks pipeline, we firstly get all docs to be updated and then

  1. Get current external version of each doc from the response
  2. Update the query clicks info of each doc
  3. Set (current version + 1) as new version of each doc
  4. Index all docs
  5. Re-try if version conflicts happen

Note: the last modified time format is yyyyMMddHHmmssfff
Example: 20170202091010123 (2017-02-02.09:10:10.123)

So + 1 in step 3 could make sure the version of newer modified doc in crawling pipeline is greater than the version of the doc with updated query clicks in query click pipeline.

I can also use same above steps in crawling pipeline, but would like to know if there are better approaches.


(Nik Everett) #4

You have two things changing the documents so you don't really have one external version. I mean, if you try to have one external version it isn't going to work very well. I think you are better off not using external versions at all and saving the CMS version in a field and using that in an _update call to sync the cms.


(Xudong You) #5

we already have a field to save the "last modified time“, then in this case, how can I use _update API in crawling pipeline to make sure older version doc won't overwrite newer version doc?

The docs updated in CMS are async processed via a distributed messaging system (Azure event Hub, similar to Kafka) in our crawling pipeline, it is possible that the older doc might be processed and sending to ES index earlier than the newer doc.


(Nik Everett) #6

For the pipeline I'd turn on retries on _update. That should be enough to
make sure concurrent updates don't fail.

For the CMS I'd check the last modified times when doing the _update and if
it is ahead of the time you are trying to write then turn the write into a
noop. There is documentation for how to do that on the _update page.


(Xudong You) #7

Thanks Nik.

Do you mean I can use scripted update as following example?

POST test/type1/1/_update
{
    "script" : {
        "inline": "if (params.lastModified > ctx._source.lastModified) { ctx._source.Title = params.Title, ctx._source.Description = params.Description, ctx._source.Url = params.Url } else { ctx.op = \"none\" }",
        "lang": "painless",
        "params" : {
            "lastModified" : "[lastmodifed time]",
            "Title": "new title value...",
            "Description": "new description value...",
            "Url": "http://example.com"
        }
    }
}

And wrap multiple scripted updates into Bulk API?


(Nik Everett) #8

Yes, that. You can totally wrap it in _bulk, yeah.


(Xudong You) #10

I compared the external version approach and the scripted update approach.

Seems DELETE doc scenarios could not be handled well using scripted update compared against external version approach.

Example.

  1. CMS updated one doc on time t1 and sent the event to Event Hub
  2. CMS deleted the same doc on time t2 (t2 > t1) and sent the event to Event Hub
  3. The doc delete event (step 2) was received by crawler earlier than doc update event (step 1)
  4. Crawler sent DELETE doc API to elasticsearch
  5. The doc was deleted from Elasticsearch
  6. Crawler sent UPDATE doc API to eleasticsearch

With external version approach (of course, INDEX API will be used in step 6), the step 6 will fail with version conflict exception, that is perfectly what we expected.

But with scripted update approach, step 6 will insert the doc as a new added doc successfully, that is not what we want.

Any workaround?


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.