openNlp don't return valid result in pdf

Hello, i've succeded, recently thanks to this site to link opennlp ,to my pdf.
However the result wasn't as good as i expected.

This the pdf :

and this what i get :

what is wrong with plugin
is there something i need to fix before using it ??

Hard to read images. Would be better to paste text only specifically when it's not a UI problem.

Anyway, what looks wrong here?

Not sure I understand the "problem". Looks like your document has entities for persons, locations, dates.

So what is wrong? What do you expect?

i 've many name in my pdf, althought openNlp didn't return them
also i' ve many adresses ,it didn't return it all

what i have get right are the dates .
need it a configuration ??

Can you reproduce it with just nlp plugin and provide just a text as the input ?

If you can may be worth opening an issue in the nlp project ?

  PUT _ingest/pipeline/opennlp-pipeline
    {
      "description": "A pipeline to do named entity extraction",
      "processors": [
      {
      "attachment" : {
       "field" : "mycontentfield"
       }
      },
    {
      "opennlp" : {
       "field" : "attachment.content"
     }
    }
 ]
}

  PUT /indice14/type/1?pipeline=opennlp-pipeline
  {
     "mycontentfield":"base64of the pdf" 
  }

it was the response i get here for a previous question.

where can i post an issue in Nlp project here or where ?
i can't find a categorie named nlp project here !!!

I said: without the attachment plugin. Could you reproduce and share your script?

No with just nlp plugin i can't .
Actually i have asked here and the owner of the plugin "Alexander Reelsen" responded that i need to use attachment plugin first.

No. He told you that if you want to extract text from a PDF document you need to use ingest-attachment.
If you want to send the text to NLP then you need to use the NLP plugin.

What I'm asking for is to use only NLP and send to it just some text.
And reproduce the issue you are seeing.

What you can do is also to use the _simulate ingest endpoint with the verbose option to see what is happening at each step. https://www.elastic.co/guide/en/elasticsearch/reference/6.2/simulate-pipeline-api.html

And share it here and also a full reproduction script if you need help.

yes , today i've put as you told me just a text in openNlp plugin and the result was the same !

this what i wrote in dev tools :

PUT /index20/type/1?pipeline=opennlp-pipeline
{
 "my_field":"Ahmed HADDAD Développement et\nAvenue de l’UMA\n\n\nConception en\n2035 
  Charguia 2\n\njava/JEE\n\nTél :25932722\n\nhaddadahmed1994@gmail.com\n23 
   ans\nFORMATION\n2017-2018 3éme année Génie Informatique\n2014-2015Diplôme Ingénieur 
   premier cycle\n2011-2012 Diplôme BAC Science\nLangues\n :  Anglais-Français et de 
   L’italien\n\nEXPERIENCES\n\nAôut 2017 : Stage en Advance Web djerba—réalisation d’un tchat en 
  AJAX et JQuery\n\n\nJuilliet 2016 : Stage d’Intiation en Advance Web djerba-un Blog communautaire     
  php et Mysql\n\n\nLOISIRS\nJ’aime la Consultation des News Lettre des Communauté  informatique 
dans le web,\nJouer les Jeux videos Steam online , un peu d’echec."
}

Using simulate is easier.
Anyway, what is the output of

GET index20/type/1

?

this is the response

 {
  "_index": "index20",
  "_type": "type",
  "_id": "1",
  "_version": 1,
   "found": true,
  "_source": {
   "my_field": """
 Ahmed HADDAD Développement et
Avenue de l’UMA


Conception en
2035 Charguia 2

java/JEE

Tél :25932722

haddadahmed1994@gmail.com
23 ans
FORMATION
2017-2018 3éme année Génie Informatique
2014-2015Diplôme Ingénieur premier cycle
2011-2012 Diplôme BAC Science
Langues
:  Anglais-Français et de L’italien

EXPERIENCES

Aôut 2017 : Stage en Advance Web djerba—réalisation d’un tchat en AJAX et JQuery


Juilliet 2016 : Stage d’Intiation en Advance Web djerba-un Blog communautaire php et Mysql


 LOISIRS
J’aime la Consultation des News Lettre des Communauté  informatique dans le web,
Jouer les Jeux videos Steam online , un peu d’echec.
""",
"entities": {
  "persons": [
    "Mysql LOISIRS J",
    "Avenue"
  ],
  "dates": [
        "2011 - 2012",
        "1994",
         "2016",
        "2014 - 2015",
       "2035"
      ],
     "locations": [
    "Avenue de l"
      ]
    }
  }
}

What do you expect?

I expect :
In person : Ahmed HADDAD
In dates: the same but without "-"
in locations: Avenue de l’UMA,2035 Charguia 2,djerba

also

  1. i see that the owner of the plugin have created "WHATEVER like person and dates and locations"
  2. what are the regex or the method he used to parse the file and extract these informations ? and could we change them ?

I'm not familiar with this plugin but I wonder if it's a problem about the language used.

Here your text is in french. May be you are using the default models which are in english (if I understand correctly this part of the doc):

ingest.opennlp.model.file.persons: en-ner-persons.bin
ingest.opennlp.model.file.dates: en-ner-dates.bin
ingest.opennlp.model.file.locations: en-ner-locations.bin

May be @spinscale has more ideas?

Yes, it's French and an Arab name written in french.

The code that you ve post is what i've added to be able to use the plugin.

Did you mean that only works with english name and date and locations ?

I don't know. That's just a guess and when @spinscale will be available, he will be able to answer.

This depends on the model. The default model only works with english, but maybe there are other working with your languages of choice. You need to check the apache opennlp for that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.