Delete data from source index of a transform that created the documents in destination index

I have two indexes, that the transform aggregates multiple documents into 1, that works fine: transformation is done from index 1 to index 2:

  1. itermediate_index - source
  2. total_index - destination
    There are 2 fields that define the uniqueness in index 2
"pivot": {
		"group_by": {
			"fullName": {
				"terms": {
					"field": "fullName"
				}
			},
			"operationId": {
				"terms": {
					"field": "operationId"
				}
			}
		},
		"aggregations": {
          //Some aggs here
        }
}

I would like to delete the documents from index 1 that were used to create the documents on index 2, in order to keep the index up to a certain size as more data keeps being indexed on index 1.
As no join exists in elastic , thought of using the transform checkpoints to see what data was processed in the transform.
Does anybody know how to access a continuous transform checkpoints?
Currently using version 7.16.3

Hi,
In the pivot section of your transform config I can't see any date_histogram group-by field. That means the old data in the source index (itermediate_index) is needed so that the transform updates the documents in the destination index (total_index) correctly when new data with the same fullName and operationId arrives.
If you add date_histogram group-by on some timestamp field, then you could assume the old data will become irrelevant and can be deleted.
One way to delete this old data would be to use retention_policy (see docs) on the transform that produces the intermediate_index (because retention_policy removes from the destination index).

Please let me know if that makes sense for your use-case.
I don't think checkpoints should be used directly by the end users as the checkpoints are internal (an implementation detail) to the transform.

1 Like

To add to what I said in the last post:
There is also option to use ILM (Index lifecycle management) on the intermediate index. Read the docs to find out more.
In short, ILM allows you to specify policy which determines which docs will be removed and when for the particular index (intermediate_index in your case).

First thing first,
Thank you for the response :slight_smile:
The intermediate_index is generated via bulk save not a transform, and the reason for deleting is keeping the size of the indexes up to a certain size
so i guess the suggestion makes less sense ... :slightly_frowning_face:

1 Like

ILM seems less appropriate, per doc, as it has limited list of actions, and it won't do a "filtered delete" as a lifecycle

As per date_histogram the transform currently in use also has:

"sync": {
				"time": {
					"field": "creationDate",
					"delay": "60s"
				}
			},

in the transform defenition.
but it may happen that some documents will not be transform as per incomplete data: let's say 4 type of documents (different values in another field: "phaseName"), with different values of let's say: a,b,c & d are required but only a,b,c arrive and the d may arrive even a day later.
so the documents with a,b,c should "wait" till the "d" arrives, so the date is only a partial help.
as per _transform/<transform_id>/_stats api we have:

"checkpointing": {
				"last": {
					"checkpoint": 3,
					"timestamp_millis": 1653227760898,
					"time_upper_bound_millis": 1653227700898
				},
				"changes_last_detected_at": 1653227760895,
				"last_search_time": 1653303961088
			}

but for now it seems that the options in mind narrow to: _preview and delete or deleteByQuery using the created documents in the total_index and timestamp specified in time_upper_bound_millis per transform stats doc
any ideas?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.