Latest Transform questions

Creating a latest transform using the API, I need to provide a set of unique_keys, which can be a list of field names.

Some questions:

  1. How do I reference a field in a sub object? Like within foo there is a key bar that needs to be part of my uniqueness set?
  2. Is there a way to select all fields EXCEPT a list (an exclude list)?
  3. Can two transforms share a destination index
  4. Is there a way to specify ALL fields in a sub object? Like foo.*
  5. How do you specify fields within an array? Where foo is an array of objects with bar inside?

All these are related in my mind. I need 1, would like to use 2 and if not I need 3 or 4+

Thanks in Advance

#1 solved, it accepts dot syntax, so foo.bar finds my field.

Where do you want to exclude fields? I suppose there is no settings about fields of latest transform.
That said, it could be realized remove ingest pipline of the dest index.

Yes but if unique_key is common between source indices, dest index could be updated by the older index. Using index pattern or array of indecies in a single transform could be recommended.

Thanks I want all fields except one in unique_key. So for everything but timestamp, I want to express this:

"unique_key": "!timestamp"

I tested some and found a TRICK. (Remember it is just a trick.)

Here is the setting.

PUT /test_unique_key/
{
  "mappings": {
    "properties": {
      "id":{"type":"keyword"},
      "date":{"type":"date"}
    }
  }
}

POST /test_unique_key/_bulk
{"index":{}}
{"id":1,"date":"2011-10-10T00:00:00.000Z"}
{"index":{}}
{"id":1,"date":"2020-10-10T00:00:00.000Z"}

If you set "timestamp" ("date" here) explicitly

GET /_transform/_preview?filter_path=preview
{
  "latest":{
    "sort":"date",
    "unique_key": ["*","date"]
  },
  "dest": "test_dest",
  "source":{
    "index":"test_unique_key"
  }
}

Two documents are distinguished.

{
  "preview" : [
    {
      "date" : "2011-10-10T00:00:00.000Z",
      "id" : 1
    },
    {
      "date" : "2020-10-10T00:00:00.000Z",
      "id" : 1
    }
  ]
}

If you set wildcard containing "timestamp",:

GET /_transform/_preview?filter_path=preview
{
  "latest":{
    "sort":"date",
    "unique_key": ["*"]
  },
  "dest": "test_dest",
  "source":{
    "index":"test_unique_key"
  }
}

The difference between "timestamp" fields are ignored and there are only one document.

{
  "preview" : [
    {
      "date" : "2020-10-10T00:00:00.000Z",
      "id" : 1
    }
  ]
}

This could be a solution. As this behavior is not explained in the official document, however, I'm not sure this is just for this specific case, for example it is not likely but it could work only with "date" field name..., or works for general situations. And also it could dissapear or change suddenly without any notion.

If you want to use this, please do thorough testing and use with extreme caution.

IMO, if you want to deduplicate the same documents, using ingest pipeline with Fingerprint processor during indexing time could be a better solution than latest transform.

This solves my problem, I think. Not the "all fields except" uniqueness problem exactly, but it lets me get around it by solving 4 above. All fields in sub-object foo are the equivalent of foo's fingerprint. Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.