Ingest Attachment Plugin update index

Hey there,

I'm trying to implement the Ingest Attachment plugin on Elasticsearch 6.8

My goal is to append the document text to an existing index.

This is a sample of my current index

{
"_index": "koha_biblios",
"_type": "data",
"_id": "7289",
"_version": 1,
"_seq_no": 1390,
"_primary_term": 1,
"found": true,
"_source": {
"title": [
"SHIP PERFORMANCE"
]
},
"author": [
"HUGHES, C. N."
],
"itype": [
"MON"
]
}

First I add a new mapping to the existing index

PUT koha_biblios
{ 
    "mappings" : { 
        "data" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "analyzer_standard" 
                } 
            } 
        } 
    } 
}

Then I create the pipeline

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

And the when I submit the file to ingest

PUT koha_biblios/data/7289?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

I lost all the information and get only the document data

{
"_index": "koha_biblios",
"_type": "data",
"_id": "7289",
"_version": 3,
"_seq_no": 11142,
"_primary_term": 1,
"found": true,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}

Is it possible to append the ingested document instead of replacing all the existing index?
What am I missing?

Best Regards,
Filipe

This behavior is expected as you are calling the index API which index a new version of the document by overwriting the previous one if any.

You can try Update By Query API | Elasticsearch Guide [7.15] | Elastic which supports the pipeline parameter.

@dadoonet thank you for your answer.

Still have two questions. How can I update a certain document (by id) and how can I provide the file to the pipeline using Update By Query API?

POST koha_biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.attachment.data=e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "7282"
        }
    }
}

This gives me an error -> illegal_argument_exception

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

I hope this helps

DELETE biblios
PUT biblios/data/1
{
  "date-of-acquisition": [
    "2021-02-03"
  ],
  "title": [
    "TÉCNICAS DE INDEXAÇÃO EM SISTEMAS DOCUMENTAIS AUTOMATIZADOS, 25-26 FEVEREIRO 1998",
    "OFERTA ANO 1999 (FH)"
  ]
}
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
POST biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.attachment='e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0='",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "1"
        }
    }
}

For some reason, you have to run the last POST individually in order to reproduce the error.

Do this:

DELETE biblios
PUT biblios/_doc/1
{
  "date-of-acquisition": [
    "2021-02-03"
  ],
  "title": [
    "TÉCNICAS DE INDEXAÇÃO EM SISTEMAS DOCUMENTAIS AUTOMATIZADOS, 25-26 FEVEREIRO 1998",
    "OFERTA ANO 1999 (FH)"
  ]
}
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
POST biblios/_update_by_query?pipeline=attachment
{
    "script": {
        "source": "ctx._source.data='e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0='",
        "lang": "painless"
    },
    "query": {
        "term": {
            "_id": "1"
        }
    }
}
GET biblios/_doc/1

The main change is

"source": "ctx._source.attachment=

to

"source": "ctx._source.data=

That's great! I now see all the previous data plus the ingested document.
Thank you @dadoonet

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.