Ingest Attachment Plugin update index

fribeiro-keep · November 16, 2021, 4:11pm

Hey there,

I'm trying to implement the Ingest Attachment plugin on Elasticsearch 6.8

My goal is to append the document text to an existing index.

This is a sample of my current index

{
"_index": "koha_biblios",
"_type": "data",
"_id": "7289",
"_version": 1,
"_seq_no": 1390,
"_primary_term": 1,
"found": true,
"_source": {
"title": [
"SHIP PERFORMANCE"
]
},
"author": [
"HUGHES, C. N."
],
"itype": [
"MON"
]
}

First I add a new mapping to the existing index

PUT koha_biblios
{ 
    "mappings" : { 
        "data" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "analyzer_standard" 
                } 
            } 
        } 
    } 
}

Then I create the pipeline

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

And the when I submit the file to ingest

PUT koha_biblios/data/7289?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

I lost all the information and get only the document data

{
"_index": "koha_biblios",
"_type": "data",
"_id": "7289",
"_version": 3,
"_seq_no": 11142,
"_primary_term": 1,
"found": true,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}

Is it possible to append the ingested document instead of replacing all the existing index?
What am I missing?

Best Regards,
Filipe

dadoonet · November 16, 2021, 7:42pm

This behavior is expected as you are calling the index API which index a new version of the document by overwriting the previous one if any.

You can try Update By Query API | Elasticsearch Guide [7.15] | Elastic which supports the pipeline parameter.

fribeiro-keep · November 17, 2021, 11:09am

@dadoonet thank you for your answer.

Still have two questions. How can I update a certain document (by id) and how can I provide the file to the pipeline using Update By Query API?

POST koha_biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.attachment.data=e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "7282"
        }
    }
}

This gives me an error -> illegal_argument_exception

dadoonet · November 17, 2021, 1:52pm

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

fribeiro-keep · November 17, 2021, 2:55pm

I hope this helps

DELETE biblios
PUT biblios/data/1
{
  "date-of-acquisition": [
    "2021-02-03"
  ],
  "title": [
    "TÉCNICAS DE INDEXAÇÃO EM SISTEMAS DOCUMENTAIS AUTOMATIZADOS, 25-26 FEVEREIRO 1998",
    "OFERTA ANO 1999 (FH)"
  ]
}
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
POST biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.attachment='e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0='",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "1"
        }
    }
}

For some reason, you have to run the last POST individually in order to reproduce the error.

dadoonet · November 17, 2021, 3:50pm

Do this:

DELETE biblios
PUT biblios/_doc/1
{
  "date-of-acquisition": [
    "2021-02-03"
  ],
  "title": [
    "TÉCNICAS DE INDEXAÇÃO EM SISTEMAS DOCUMENTAIS AUTOMATIZADOS, 25-26 FEVEREIRO 1998",
    "OFERTA ANO 1999 (FH)"
  ]
}
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
POST biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.data='e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0='",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "1"
        }
    }
}
GET biblios/_doc/1

The main change is

"source": "ctx._source.attachment=

to

"source": "ctx._source.data=

fribeiro-keep · November 17, 2021, 4:00pm

That's great! I now see all the previous data plus the ingested document.
Thank you @dadoonet

system · December 15, 2021, 4:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest-attachment ingest local docs Elasticsearch	4	453	November 18, 2018
Problem with Ingest Attachment Processor Plugin Elasticsearch	8	1204	November 24, 2017
Ingest Attachment Plugin: How to add data to an existing array? Elasticsearch	5	705	July 23, 2020
Some doubt about ingest-attachment Elasticsearch	1	329	July 8, 2019
Ingest pipeline in Update Api Elasticsearch	1	651	May 1, 2020

Ingest Attachment Plugin update index

Related topics