Elasticsearch-ingest-opennlp pdf


(Ahmed HADDAD) #1

Hello , is that possible to use the plugin "ingest-opennlp " in pdf ??


openNlp don't return valid result in pdf
(Alexander Reelsen) #2

You can use the ingest attachment plugin first, and then run the opennlp processor against the field that was created by the attachment plugin.

--Alex


(Ahmed HADDAD) #3

you mean like this ? :

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
       "field" : "data"
     }
   }
 ]
}

 PUT _ingest/pipeline/opennlp-pipeline
   {
 "description": "A pipeline to do named entity extraction",
 "processors": [
   {
  "opennlp" : {
    "field" : "data"
     }
   }
  ]
}

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
 "data" :"base 64-pdf_conversion "
  }

normally the OpenNlp need to filter the pdf before ingesting it ? Not ?
please tell me how ?


(Shane Connelly) #4

processors is an array. Instead of setting up 2 separate pipelines, set up a single pipeline with the attachment processor first and the opennlp processor next.


(Ahmed HADDAD) #5

could you show me how please because i have lost a day in this problem.

also the two plugins are different how that could be


(Shane Connelly) #6

Something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "<<base-64 encoded pdf field>>",
        <<any additional ingest-attachment parameters>>
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content",
        <<any additional ingest-opennlp parameters>>
      }
    }
  ]
}

(Ahmed HADDAD) #7

ok the request is valid without errors Thank you ! , but where would i find the results ,the index ?


(Shane Connelly) #8

Yes, you can PUT or POST a document with the ?pipeline=... (as you did). After this, you should bet able to do something like GET /indice12/_search and see a result.

You can also actually test your pipeline by using the simulate API without having to run a document into your index and then _search for or GET it


(Shane Connelly) #10

it require a body what should i put in the body i m new to es

The base-64 encoded PDF as a field. And then you'd reference that field name in the field component of the attachment processor


(Ahmed HADDAD) #11

no you didn't understand me
i talk about this:

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  here the problem, what should i put here it dosen't work 

   }

(Shane Connelly) #12

You should put something like "body": "aGVsbG8gdGhlcmU=" or whatever your base-64 encoded pdf is. Assuming you use body here, the field component I have under the attachment section of PUT _ingest/pipeline/opennlp-pipeline would be body


(Ahmed HADDAD) #13

i did as you told me :

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  "body":"pdf-conversion-to-base64" 

  }

And it returned error
java.lang.IllegalArgumentException


(Ahmed HADDAD) #14

Any way the first step has succeded:


(Shane Connelly) #15

The opennlp-pipeline attachment.field should be a field name that you're going to pass in, not the content of the PDF. You've also gotten the order of the opennlp/attachment processors in reverse.


(Ahmed HADDAD) #16

ok for the order
but above in your response

attachment" : {
    "field" : "<<base-64 encoded pdf field>>",
    <<any additional ingest-attachment parameters>>
  }

I m confused??


(Shane Connelly) #17

So you'd have something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "mycontentfield"
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content"
      }
    }
  ]
}

and then

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
  "mycontentfield": "aGVsbG8gdGhlcmU="
}

(Ahmed HADDAD) #18

Thank you so much


(system) #19

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.