Elasticsearch-ingest-opennlp pdf

akml_kk · March 7, 2018, 4:15pm

Hello , is that possible to use the plugin "ingest-opennlp " in pdf ??

spinscale · March 7, 2018, 4:30pm

You can use the ingest attachment plugin first, and then run the opennlp processor against the field that was created by the attachment plugin.

--Alex

akml_kk · March 8, 2018, 7:41am

you mean like this ? :

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
       "field" : "data"
     }
   }
 ]
}

 PUT _ingest/pipeline/opennlp-pipeline
   {
 "description": "A pipeline to do named entity extraction",
 "processors": [
   {
  "opennlp" : {
    "field" : "data"
     }
   }
  ]
}

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
 "data" :"base 64-pdf_conversion "
  }

normally the OpenNlp need to filter the pdf before ingesting it ? Not ?
please tell me how ?

shanec · March 8, 2018, 3:48pm

processors is an array. Instead of setting up 2 separate pipelines, set up a single pipeline with the attachment processor first and the opennlp processor next.

akml_kk · March 8, 2018, 3:50pm

could you show me how please because i have lost a day in this problem.

also the two plugins are different how that could be

shanec · March 8, 2018, 4:04pm

Something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "<<base-64 encoded pdf field>>",
        <<any additional ingest-attachment parameters>>
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content",
        <<any additional ingest-opennlp parameters>>
      }
    }
  ]
}

akml_kk · March 8, 2018, 4:22pm

ok the request is valid without errors Thank you ! , but where would i find the results ,the index ?

shanec · March 8, 2018, 4:28pm

Yes, you can PUT or POST a document with the ?pipeline=... (as you did). After this, you should bet able to do something like GET /indice12/_search and see a result.

You can also actually test your pipeline by using the simulate API without having to run a document into your index and then _search for or GET it

shanec · March 8, 2018, 5:35pm

it require a body what should i put in the body i m new to es

The base-64 encoded PDF as a field. And then you'd reference that field name in the field component of the attachment processor

akml_kk · March 8, 2018, 6:38pm

no you didn't understand me
i talk about this:

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  here the problem, what should i put here it dosen't work 

   }

shanec · March 8, 2018, 6:42pm

You should put something like "body": "aGVsbG8gdGhlcmU=" or whatever your base-64 encoded pdf is. Assuming you use body here, the field component I have under the attachment section of PUT _ingest/pipeline/opennlp-pipeline would be body

akml_kk · March 8, 2018, 7:08pm

i did as you told me :

PUT /indice12/type/1?pipeline=opennlp-pipeline{

  "body":"pdf-conversion-to-base64" 

  }

And it returned error
java.lang.IllegalArgumentException

akml_kk · March 8, 2018, 7:36pm

Any way the first step has succeded:

shanec · March 8, 2018, 7:40pm

The opennlp-pipeline attachment.field should be a field name that you're going to pass in, not the content of the PDF. You've also gotten the order of the opennlp/attachment processors in reverse.

akml_kk · March 8, 2018, 7:52pm

ok for the order
but above in your response

attachment" : {
    "field" : "<<base-64 encoded pdf field>>",
    <<any additional ingest-attachment parameters>>
  }

I m confused??

shanec · March 8, 2018, 8:00pm

So you'd have something like

PUT _ingest/pipeline/opennlp-pipeline
{
 "description": "A pipeline to do named entity extraction",
 "processors": [
    {
      "attachment" : {
        "field" : "mycontentfield"
      }
    },
    {
      "opennlp" : {
        "field" : "attachment.content"
      }
    }
  ]
}

and then

PUT /indice12/type/1?pipeline=opennlp-pipeline
{
  "mycontentfield": "aGVsbG8gdGhlcmU="
}

akml_kk · March 8, 2018, 8:07pm

Thank you so much

system · April 5, 2018, 8:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch : How to Struct the content of a PDF file read by Ingest Attachment Processor Plugin for PDF Elasticsearch	2	390	April 17, 2019
PDF- ingest attachement plugin Elasticsearch	2	449	April 3, 2018
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6005	March 4, 2021
openNlp don't return valid result in pdf Elasticsearch	19	1022	April 12, 2018
Can the Ingest Attachment Processor Plugin extract array data? Elasticsearch	8	2767	January 18, 2017

Elasticsearch-ingest-opennlp pdf

Related topics