Transform is only partially updated

Hi everyone,

I have an transform that suppose to track the latest doc of some index.
The transform I made is based on field called "etl_id" and for some reason it is updated only for few "etl_id" but not all of them, here is an example of some doc that was not updated in the transform:

source index (etl_logs):

{
        "_index": "etl_logs",
        "_id": "C1FKPI0BOAhJ8yM6psQ7",
        "_score": null,
        "_source": {
          "level": "INFO",
          "host": "host1",
          "etl_filename": "etl.py",
          "etl_id": "1988",
          "items_processed": 1,
          "log_data": "",
          "timestamp": "2024-01-24T18:25:41+03:00"
        }

transform index (latest_etl_log):

{
        "_index": "latest_etl_log",
        "_id": "MYWbD3aM_AEJlF6WStE10ckAAAAAAAAA",
        "_score": null,
        "_source": {
          "items_processed": 1,
          "etl_id": "1988",
          "level": "INFO",
          "etl_filename": "etl.py",
          "host": "host1",
          "log_data": "",
          "timestamp": "2024-01-22T16:52:57+03:00"
        }

And here is the transform settings:

{
  "count": 1,
  "transforms": [
    {
      "id": "latest_etl_log",
      "authorization": {
        "roles": [
          "superuser"
        ]
      },
      "version": "8.7.1",
      "create_time": 1705401981779,
      "source": {
        "index": [
          "etl_logs"
        ],
        "query": {
          "match_all": {}
        }
      },
      "dest": {
        "index": "latest_etl_log"
      },
      "sync": {
        "time": {
          "field": "timestamp",
          "delay": "60s"
        }
      },
      "latest": {
        "unique_key": [
          "etl_id.keyword"
        ],
        "sort": "timestamp"
      },
      "settings": {}
    }
  ]
}

Any ideas what can be the problem? why some of the etl_id are updated correctly and the others stays behind?

Hi @Doron_Abramovich,

Can you please try changing your transform configuration and use

"unique_key": [
          "etl_id"
        ],

instead ?

That being said, I'm not sure I fully understand your problem :
are you missing latest docs for some etl_id, or do you have docs in your destination index for every etl_id, but these docs are not the latest ones ?

The example you provided was describing the latter, and it's a different problem overall.

1 Like

@greco
Yeah sorry it was not so clear,

I have all needed etl_id on the destination index but only some of them are keep on updating, the example I gave was a document that was suppose to be updated according to the source index (with the date 2024-01-24) where the destination index after transformed holding older document (with the date 2024-01-22)

I will try your solution anyhow and let you know if it works out :slight_smile:

edited:
while trying to use etl_id instead if etl_id.keyword I get an error message

Fielddata is disabled on [etl_id] in [etl_logs]. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default.

Is there another solution you can think of,
You have an idea of what might be the problem?

Ok, thanks for clarifying.

The only reason I can think of for these missing docs would be a transform that struggles to catch up for some values of etl_id : do you have a lot of documents ?

Are your transforms lagging ?

There are only 45 types of etl_id options, thats pretty solid i guess..
The etl_id can be both a number or string, for exmaple - "1424" and "FTP_daily_process" are two different types.

The field mapping of source index etl_logs is :

"etl_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
       }

I now changed it to be a fieldata and use etl_id instead if etl_id.keyword,

will let you know soon if it worked out

Update:
The problem still happens, my other transforms works properly, it is just this one..

Thank you for trying all this,
We need a bit more time to investigate further.

After quick check on the transform messages I can see that I do get a warning message:
Non-empty destination index [latest_etl_log]. Contains [70] total documents.

Thank you!
I appreciate your help :slight_smile:

One more reason for missing updates I can think of is that when the document is ingested into source index, its timestamp is "old", i.e.: older than 60s (the configured delay) than the actual server timestamp.
Usually the solution for that is to have an ingest pipeline on the source index that will populate event.ingested field for every source document and then to use event.ingested field in the sync.time section of the transform config.

You can find the pipeline code here:

This is actually a good point, As I remember I struggled using the server timestamp for something because it was not match to my country timezone,

The only question is.. How come this problem happens only for a certain index?
Maybe it happens because this time field name in this index is "timestamp" which is the same as the server uses?