Hello all,
I'm kinda new at elasticsearch and i have a bit of a situation:
Every 3-4 days i receive a big file (around 30 mil. documents) to insert into elasticsearch. I process the file line by line and insert into elasticsearch. The thing is, each new file is exactly like the last one but with a few changes (i don't know where the changes are). I receive the full files and i rely on elasticsearch to insert only the diffs. I do this by computing a hash based on a few fields and if there are some changes, the hash will be different. Then i use the hash as the id of the document. If a new file contains the same document as an older file, i insert it again and elasticsearch just increments the _version field and that's all.
The problem:
I need two fields in each document:
- first_seen: When the index was first created
- last_seen: Updated everytime this id(hash) is inserted again
How can i do this in elasticsearch?
I've created a ingest pipeline:
PUT _ingest/pipeline/createddate
{
"description": "add timestamp field to the document, requires a datetime field date mapping",
"processors": [
{
"date" : {
"field" : "last_seen",
"formats" : ["yyyy-MM-dd"],
"target_field": "first_seen"
}
}
]
}
that gets the "first_seen" value from the insert and sets it to "last_seen" BUT i want this to happen only for new IDs (new hashes).
This one applyes everytime so when i insert again this id, both dates (last_seen and first_seen) gets updated.
Hope i made myself clear and that someone can help me