Using ingest pipeline attachment processor with inference processor

Hello, i have a data that i would like to process through the attachment processor, then use elser inference processor to do semantic search.

I created my index with mapings:

PUT pdf
{
  "mappings": {
    "properties": {
      "nlp_enrichment": { 
        "type": "sparse_vector" 
      },
      "attachment": { 
        "type": "object" 
      }
    }
  }
}

Then I created the ingest pipeline:

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "attachment": {
        "field": "data",
        "remove_binary": true
    },
      "inference": {
        "model_id": ".elser_model_2_linux-x86_64",
        "input_output": [ 
          {
            "input_field": "attachment",
            "output_field": "nlp_enrichment"
          }
        ]
      }
    }
  ]
}

Then i enter data:

POST pdf/_doc/1?pipeline=elser-v2-test
{
  "data":"<base64 encoded>"
}

But it returns error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "input field [attachment] cannot be processed because it is not a text field"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "input field [attachment] cannot be processed because it is not a text field"
  },
  "status": 400
}

Shouldn't the attachment processor creates processor field with object type, so the inference can get the field value of attachemnt? Or does the elser model only able to use text type?

Try with:

"input_field": "attachment.content",
2 Likes

using attachment.content works, but is it possible to inference all field in an object? I tried using foreach processor:

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "attachment": {
        "field": "data",
        "remove_binary": true
    },
      "foreach": {
        "field": "attachment",
        "processor": {
          "inference": {
            "model_id": ".elser_model_2_linux-x86_64",
            "input_output": [ 
              {
                "input_field": "_ingest._key",
                "output_field": "nlp_enrichment._ingest._key"
              }
            ]
          }
        }
      }
    }
  ]
}

But it returns an object array. I've read in the documentation of inference that you can use multiple input output:

 "input_output": [ 
              {
                "input_field": "attachment.content",
                "output_field": "nlp_enrichment.content"
              },
              {
                "input_field": "attachment.content_type",
                "output_field": "nlp_enrichment.content_type"
              },
]

how to do this so that input_field will use each attachment object key?

I don't understand why you'd do such a thing. IMO only a small set of generated fields are eligible to run elser on.

I don't see why you would like to run elser on fields like:

  "content_type": "application/rtf",
  "language": "ro",
  "content_length": 28