Update_by_query versus terms performance

There is a use case I am working on as described below:

I have documents(transactions) getting saved in below format:

String date
String versionId // UUID
// OTHER FIELDS

Each date has multiple version transactions that goes up to few millions in each version(1-100mil)

Problem statement is to fetch only active transactions(with pagination)

  • for a given date
  • for a specified date range
  • for all dates(no date filter)

I have narrowed down my design to below options:

Option 1:
Add a new field active to maintain the latest version.

String date
String versionId
boolean active // Gives me active docs with value set to true
// OTHER FIELDS

For every date, I can have multiple versions with only latest versionId transactions set active to true. Whenever a new version is being indexed, I am reverting the older versionId transactions active to false using _update_by_query.

How update happens?

  • _update_by_query(wait_for_completion=false) with active = true and set active = false using script.
  • Wait for above task to complete.
  • Insert new documents(1-100million docs) with next versionId and active = true in transactions index.

How search happens?
Now, search happens directly on the index(transactions) with active = true for all cases:

  • for a given date
  • for a specified date range
  • for all dates(no date filter)

Pagination is through search_after.

Questions:

  • Is _update_by_query efficient here?

Option 2:
Introduce a new index(transactions_meta) for meta store with below format:

String date
String versionId
boolean active

This meta index will have very less data. Each date and versionId will have one document as opposed to many documents in transactions index.

How update happens?

  • _update_by_query with active = true and set active = false using script in transactions_meta index.
  • Insert single document with next versionId and active = true in the transactions_meta index.
  • Insert new documents(1-100million docs) in same versionId(from above) in transactions index.

How search happens?
Now, search is a two step process:

  • Search all the documents with active = true in transactions_meta index for multiple dates(using terms or range). Get all documents using search_after.
  • Then Get unique versionIds from above step.
  • Search transactions index using terms filter having versionIds from above. Pagination is through search_after.

Questions:

  • Is terms filter efficient here? There will be more than 200 terms values.

Which option is better? I am inclined towards option 1.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.