Using ingest pipeline attachment processor with inference processor

mikhatanu · July 9, 2024, 3:58am

Hello, i have a data that i would like to process through the attachment processor, then use elser inference processor to do semantic search.

I created my index with mapings:

PUT pdf
{
  "mappings": {
    "properties": {
      "nlp_enrichment": { 
        "type": "sparse_vector" 
      },
      "attachment": { 
        "type": "object" 
      }
    }
  }
}

Then I created the ingest pipeline:

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "attachment": {
        "field": "data",
        "remove_binary": true
    },
      "inference": {
        "model_id": ".elser_model_2_linux-x86_64",
        "input_output": [ 
          {
            "input_field": "attachment",
            "output_field": "nlp_enrichment"
          }
        ]
      }
    }
  ]
}

Then i enter data:

POST pdf/_doc/1?pipeline=elser-v2-test
{
  "data":"<base64 encoded>"
}

But it returns error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "input field [attachment] cannot be processed because it is not a text field"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "input field [attachment] cannot be processed because it is not a text field"
  },
  "status": 400
}

Shouldn't the attachment processor creates processor field with object type, so the inference can get the field value of attachemnt? Or does the elser model only able to use text type?

dadoonet · July 9, 2024, 6:27am

Try with:

"input_field": "attachment.content",

mikhatanu · July 9, 2024, 8:26am

using attachment.content works, but is it possible to inference all field in an object? I tried using foreach processor:

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "attachment": {
        "field": "data",
        "remove_binary": true
    },
      "foreach": {
        "field": "attachment",
        "processor": {
          "inference": {
            "model_id": ".elser_model_2_linux-x86_64",
            "input_output": [ 
              {
                "input_field": "_ingest._key",
                "output_field": "nlp_enrichment._ingest._key"
              }
            ]
          }
        }
      }
    }
  ]
}

But it returns an object array. I've read in the documentation of inference that you can use multiple input output:

 "input_output": [ 
              {
                "input_field": "attachment.content",
                "output_field": "nlp_enrichment.content"
              },
              {
                "input_field": "attachment.content_type",
                "output_field": "nlp_enrichment.content_type"
              },
]

how to do this so that input_field will use each attachment object key?

dadoonet · July 9, 2024, 1:13pm

I don't understand why you'd do such a thing. IMO only a small set of generated fields are eligible to run elser on.

I don't see why you would like to run elser on fields like:

  "content_type": "application/rtf",
  "language": "ro",
  "content_length": 28

Topic		Replies	Views
Conditionally Apply Attachment Processor Elasticsearch painless , ingest-pipeline	1	279	July 20, 2022
Can the Ingest Attachment Processor Plugin extract array data? Elasticsearch	8	2768	January 18, 2017
Problem with Ingest Attachment Processor Plugin Elasticsearch	8	1206	November 24, 2017
Ingest question - attachment processor plugin and dynamic fields Elasticsearch	1	1275	August 6, 2017
Elasticsearch-ingest-opennlp pdf Elasticsearch	17	1944	April 5, 2018

Using ingest pipeline attachment processor with inference processor

Related topics