Refresh_interval for denormalized views with high indexing load

Pahomovda · May 8, 2018, 11:01am

Hi!
I am trying to create materialized view of relational data in Elasticsearch. Lets say i have child table and multiple parent tables (all backed by separated microservices with event queue). I need to find relevant childs by parent properties mostly.
So i have my model denormalized like:

PUT matview/foo/1
{
  "foo": "bar",
  "parent1": { "id": "1", ... },
  "parent2": { "id": "2", ... }
}

When parent1 update event accepted i perform _update_by_query to update parent1 objects by id

POST matview/_update_by_query
{
  "query": { "term": { "parent1.id": "1" } },
  "script": { "source": "ctx._source.parent1.name='foo'" }
}

Problem is I am constantly getting version conflict errors on update_by_query and delete_by_query. Preferable way to avoid those is using refresh=wait_for and client-side retries as discussed here
github issue
and here
?refresh doc
So i think i should decrease refresh_interval for this case to reduce probability of version conflicts and following retries

On the other hand my index is under havy indexing load and as described here
tune for indexing speed doc
i should increase refresh_interval to increase indexing speed.
So i cant use refresh=wait_for because it could wait too long for big refresh_interval value

Question is: what to do increase or decrease? -_- or maybe there is some other ways to overcome this problem?
for example maybe some hack to disable versioning (i know can use external versioning in update by id but it seems no way to use external versioning in update_by_query).
As i said before i use queues to populate elasticsearch so i have full control of index(shard) update order.

loren · May 8, 2018, 2:29pm

I don't think any combination of refresh_interval or refresh is going to address your problem, and even if it did, you would have another problem with performance because you'll be creating lots of tiny Lucene segments with all those updates to the same document.

Consider collecting all your parent.id: name updates in memory (e.g., {"1": "foo", ...}) and then applying them all in a single bulk operation after all your PUT matview/foo/..'s are done (ideally via _bulk) and the index has been refreshed.

Since you are using queues, you can change the unit of work from one parent to a group of say 100 parents to make bulk operations more efficient (i.e. reduce small segment creation). Hope this helps!

system · June 5, 2018, 2:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Prevent Elasticsearch re-index on update by query Elasticsearch	2	623	January 29, 2018
Two successive UpdateByQuery, How to refresh after first one Elasticsearch	7	2380	April 14, 2017
`refresh=wait_for` taking unexpectedly long Elasticsearch	5	4805	May 25, 2017
What would cause refresh=wait_for to regularly take 2-5 seconds? Elasticsearch	9	7086	December 26, 2017
Update by query and refresh Elasticsearch	3	2566	July 6, 2017

Refresh_interval for denormalized views with high indexing load

Related topics