Mapping - transform: only for creating new and not for updating?


(Sven Beauprez) #1

I tested follow mapping - transform script

"mappings": {
"tag": {
"_timestamp": {
"enabled": true
},
"transform": {
"script": "ctx._source.counter = ctx._source.counter == null ? 1 : 2",
"lang": "groovy"
},
...
"counter": {
"type": "long",
"index": "not_analyzed",
"store" : true,
"doc_values": true
},
...

and it worked fine when indexing a new document, the counter field is set to 1 for each new entry added.

When I want to update/overwrite an existing entry, the script is not executed anymore. In other words, the counter is not set to 2. Is this correct?

I would like to prevent using scripting inside the update API request (no need for the client to know this is needed), but still have some modifications in documents that are being updated in ES.

I also thought I would be able to check with ctx.op if a document is created or updated...

Any ideas?

regards,

Sven


(Sven Beauprez) #2

Just to add some context, maybe @nik9000 or @Mark_Harwood have some ideas :

I am using the entity centric approach to store processed information in a second index based on log data, which is stored 'as is' (raw) in another index.

Data comes via logstash which uses bulk upload (another argument why scripts in update is not an option: it is not supported) towards both indexes. A unique id is used to update data in the second index, but some processing needs to be done before it can be stored (similar as session duration when talking about web logs)

I would like to avoid any external processing (ie. outside ES, such as in python) that takes data from index one, process it and store it in the second index as shown in the presentations of Mark. It seems mapping - transform should work in my simple case, but I got stuck with the above.

regards,

Sven


(Nik Everett) #3

That's really not the point of transforms. They are supposed to be super copy_tos. The point is that the data from the transform shouldn't be in the _source and its a bug if it is.

Scripts are supported on bulk update. I dunno if logstash supports them - it should.

Transform is for transforming. The update script is for updating. I think the update script is more right here.


(Sven Beauprez) #4

@nik9000 I think I made a mistake in choosing my wording: I really do not want to change the _source itself. I want to use the updated data, do some processing and 'add' that result to be indexed.

For example, an entry is made with a timestamp, an update is made with a new timestamp and I want to keep a duration between log statements. In other words in this case the difference between last and first timestamp. The _source, which is updated, has the latest timestamp, which is exactly what I want. As an extra, I have a computed field that contains a duration (diff).

Does that suit the case of mapping transform or am I stretching things?


(Nik Everett) #5

If you want it to come back in the _source then its not going to work - if you just want to be able search for it then its fine. I think if you want a diff you probably want it in the _source though so I think I'd go with the update script. Transform is really for situations like "I want to copy my text field to my suggest field but only if my namespace field is 0".


(Sven Beauprez) #6

Ok, got it.

Just to come back to the original question for completeness and future reference: it only works when creating new documents, not with updates of existing documents, am I right?


(Nik Everett) #7

Its certainly supposed to work for updates.


(Sven Beauprez) #8

I've made following simple test, both in ES 1.6 and 1.7 and the counter only updated when document was created:

{
"mappings": {
"simple": {
"transform": {
"lang":"groovy",
"script":"ctx._source.counter = ctx._source.counter == null ? 1 : ctx._source.counter + 1"
},
"properties": {
"title": {
"type": "string"
},
"description": {
"type": "string"
},
"counter": {
"type": "long",
"store": "yes"
}
}
}
}
}

Then:

PUT
{ "title":"This is a document with text", "description":"null" }

When getting the document, the counter is correctly set to 1

When overwriting the document (PUT on same URL, not with _update) the counter is still 1 while version is clearly increased (get with _source_transform):

{
"_index": "transform",
"_type": "simple",
"_id": "1",
"_version": 3,
"found": true,
"_source": {
"title": "This is a document with text",
"description": "null",
"counter": 1
}
}

Same when I update via POST with _update URL

POST .../_update
{ "doc" : { "title":"This is a document with text", "description":"null" }}

Result (get with _source_transform):

{
"_index": "transform",
"_type": "simple",
"_id": "1",
"_version": 4,
"found": true,
"_source": {
"title": "This is a document with text",
"description": "null",
"counter": 1
}
}

Am I making a stupid mistake here?


(Nik Everett) #9

OK. I had a think about this. The reason this happens is that changes from
transform aren't saved to the source. They aren't supposed to be. They are
for when you want to index stuff that doesn't match the source. Just like
copy_to.


(Sven Beauprez) #10

IMO this is also true for updates. Do I need to open a github issue to discuss this?

Anyway, it is clear now what works and what doesn't. Thanks for your help!


(system) #11