Elasticsearch-ingest-opennlp pdf

Hello , is that possible to use the plugin "ingest-opennlp " in pdf ??

You can use the ingest attachment plugin first, and then run the opennlp processor against the field that was created by the attachment plugin.

--Alex

2 Likes

you mean like this ? :

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
       "field" : "data"
     }
   }
 ]
}

 PUT _ingest/pipeline/opennlp-pipeline
   {
 "description": "A pipeline to do named entity extraction",
 "processors": [
   {
  "opennlp" : {
    "field" : "data"
     }
   }
  ]
}

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
 "data" :"base 64-pdf_conversion "
  }

normally the OpenNlp need to filter the pdf before ingesting it ? Not ?
please tell me how ?

processors is an array. Instead of setting up 2 separate pipelines, set up a single pipeline with the attachment processor first and the opennlp processor next.

could you show me how please because i have lost a day in this problem.

also the two plugins are different how that could be

Something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "<<base-64 encoded pdf field>>",
        <<any additional ingest-attachment parameters>>
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content",
        <<any additional ingest-opennlp parameters>>
      }
    }
  ]
}
2 Likes

ok the request is valid without errors Thank you ! , but where would i find the results ,the index ?

Yes, you can PUT or POST a document with the ?pipeline=... (as you did). After this, you should bet able to do something like GET /indice12/_search and see a result.

You can also actually test your pipeline by using the simulate API without having to run a document into your index and then _search for or GET it

it require a body what should i put in the body i m new to es

The base-64 encoded PDF as a field. And then you'd reference that field name in the field component of the attachment processor

no you didn't understand me
i talk about this:

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  here the problem, what should i put here it dosen't work 

   }

You should put something like "body": "aGVsbG8gdGhlcmU=" or whatever your base-64 encoded pdf is. Assuming you use body here, the field component I have under the attachment section of PUT _ingest/pipeline/opennlp-pipeline would be body

i did as you told me :

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  "body":"pdf-conversion-to-base64" 

  }

And it returned error
java.lang.IllegalArgumentException

Any way the first step has succeded:

The opennlp-pipeline attachment.field should be a field name that you're going to pass in, not the content of the PDF. You've also gotten the order of the opennlp/attachment processors in reverse.

ok for the order
but above in your response

attachment" : {
    "field" : "<<base-64 encoded pdf field>>",
    <<any additional ingest-attachment parameters>>
  }

I m confused??

So you'd have something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "mycontentfield"
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content"
      }
    }
  ]
}

and then

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
  "mycontentfield": "aGVsbG8gdGhlcmU="
}
1 Like

Thank you so much

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.