Document (pdf) containing quotes are not well parsed or queried


#1

Hello, I've got a little problem. I'm indexing documents (PDF)
in french using the excellent plugin "mapper-attachment". We are using
elastic-search version 1.7.4 and have installed the plugin corresponding
to the right version (at least I hope)

The plugin has been installed like that: sudo
/usr/share/elasticsearch/bin/plugin install
elasticsearch/elasticsearch-mapper-attachments/2.7.0

Everything works fine except for words with quotes. In french we need to
separate words using quotes for exemple " the attention " is translated
as " l'attention ". When I index an attachment having the words "
l'attention " and when I search for "attention", it doesn't match.

When I index a regular String using the same analyzer it works fine. "
l'attention " matches with "attention". The french elision filter works
fine on regular Strings but not on attachment.

I hope someone will be able to help me.

If needed I can provide you a testcase that shows exactly the problem.


(David Pilato) #2

I don't think it's related to mapper plugin but to the analyzer.

May be share what you did ?


#3

OK. I wrote a simple example to show my problem.
In order to work properly, the script and the pdf file have to be in the same directory. (it work well on any bash, even git bash on windows)

Since I could not upload a pdf or an sh script in the reply, I temporarily pushed it on my website.
You can find the example at this address... (If there is a better way to do it, tell me and I will do it)

http://www.batipedia.com/pdfQuotesExample.zip

Thanks for your help


(David Pilato) #4

Better to use gist.github.com

See https://www.elastic.co/help/


(David Pilato) #5

I gave a quick look. You can simplify a lot your test.

Just:

  • delete index
  • create index with your analyzer
  • run _analyze API on this index using your analyzer with your text l'attention and you will see what elasticsearch actually index

See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html for details.

Then if it's still unclear create a simple GIST with that.
If you are using sense, no need to copy and past curl commands, but just the SENSE script.


#6

Before looking for help, that's exactly what I did. I tried a lot of different combination, different analysers, etc... When I execute the _analyze API with l'attention, it returns the word attent which is exactly what I was expecting...
But, for some reason, it doesn't work on attachments... How could I execute _analyze on an attachment?


#7

Oh, yeah. One more detail. When I index a word attachment (.doc), it works perfectly (even with my full analyser, including synonyms) but when I index a PDF (with the same content) it doesn't work. Strange, no?


(David Pilato) #8

Indeed... Weird. I'll try to look at it tomorrow.


#9

Sorry. I think that I may have been wrong about the difference between word and PDFs. I tried to reproduce it, this morning and I couldn't. I got the same result with one and the other.
The difference is that, when the text is copied and paste between the PDF and word, quotes are changed into apostrophes, some spaces are added, etc... Word is modifying the text. This explains why it worked on word and not PDFs.
But, when the content is strictly the same, I have the same result on both files...


(David Pilato) #10

Can you change your mapping from:

"pdfFile": {
   "type": "attachment", 
   "analyzer": "my_analyzer"
}

To:

"pdfFile": {
   "type": "attachment", 
   "fields" : {
      "content" : {"type" : "string", "analyzer": "my_analyzer"}
   }
}

And see what happens then?


#11

Yes! You did it... It works. Thank you.
I just had to make one little adaptation to your code. Instead of content, I had to use the same name as the field :


"pdfFile": {
"type": "attachment",
"fields" : {
"pdfFile" : {"type" : "string", "analyzer": "my_analyzer"} // Instead of content, here goes pdfFile, like the attachment field name.
}
}

Thanks a lot for your help. I would never have thought of that myself. Everything works fine now.


(David Pilato) #12

Which elasticsearch version are you using?

And by the way, I think it would be good to add this as documentation at https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html.

Contributions are warmly welcomed! :slight_smile:


#13

I'm using elasticsearch version 1.7.4
I'll try to update the elasticsearch documentation, and hopefully I'll be able to explain this properly.
Do you think that I should explain the problem in the documentation, or should I propose to map the content field as a best practice?
Your documentation is very well done and I don't want to make a mistake. Will my modification be directly online or will it be controlled by a webmaster first?


(David Pilato) #14

We will review it. No worries.


(system) #15