Document (pdf) containing quotes are not well parsed or queried

rodrigue · February 11, 2016, 1:29pm

Hello, I've got a little problem. I'm indexing documents (PDF)
in french using the excellent plugin "mapper-attachment". We are using
elastic-search version 1.7.4 and have installed the plugin corresponding
to the right version (at least I hope)

The plugin has been installed like that: sudo
/usr/share/elasticsearch/bin/plugin install
elasticsearch/elasticsearch-mapper-attachments/2.7.0

Everything works fine except for words with quotes. In french we need to
separate words using quotes for exemple " the attention " is translated
as " l'attention ". When I index an attachment having the words "
l'attention " and when I search for "attention", it doesn't match.

When I index a regular String using the same analyzer it works fine. "
l'attention " matches with "attention". The french elision filter works
fine on regular Strings but not on attachment.

I hope someone will be able to help me.

If needed I can provide you a testcase that shows exactly the problem.

dadoonet · February 11, 2016, 3:30pm

I don't think it's related to mapper plugin but to the analyzer.

May be share what you did ?

rodrigue · February 11, 2016, 5:28pm

OK. I wrote a simple example to show my problem.
In order to work properly, the script and the pdf file have to be in the same directory. (it work well on any bash, even git bash on windows)

Since I could not upload a pdf or an sh script in the reply, I temporarily pushed it on my website.
You can find the example at this address... (If there is a better way to do it, tell me and I will do it)

http://www.batipedia.com/pdfQuotesExample.zip

Thanks for your help

dadoonet · February 11, 2016, 5:40pm

Better to use gist.github.com

See https://www.elastic.co/help/

dadoonet · February 11, 2016, 6:15pm

I gave a quick look. You can simplify a lot your test.

Just:

delete index
create index with your analyzer
run _analyze API on this index using your analyzer with your text l'attention and you will see what elasticsearch actually index

See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html for details.

Then if it's still unclear create a simple GIST with that.
If you are using sense, no need to copy and past curl commands, but just the SENSE script.

rodrigue · February 11, 2016, 6:52pm

Before looking for help, that's exactly what I did. I tried a lot of different combination, different analysers, etc... When I execute the _analyze API with l'attention, it returns the word attent which is exactly what I was expecting...
But, for some reason, it doesn't work on attachments... How could I execute _analyze on an attachment?

rodrigue · February 11, 2016, 7:14pm

Oh, yeah. One more detail. When I index a word attachment (.doc), it works perfectly (even with my full analyser, including synonyms) but when I index a PDF (with the same content) it doesn't work. Strange, no?

dadoonet · February 11, 2016, 7:20pm

Indeed... Weird. I'll try to look at it tomorrow.

rodrigue · February 12, 2016, 9:43am

Sorry. I think that I may have been wrong about the difference between word and PDFs. I tried to reproduce it, this morning and I couldn't. I got the same result with one and the other.
The difference is that, when the text is copied and paste between the PDF and word, quotes are changed into apostrophes, some spaces are added, etc... Word is modifying the text. This explains why it worked on word and not PDFs.
But, when the content is strictly the same, I have the same result on both files...

dadoonet · February 12, 2016, 10:48am

Can you change your mapping from:

"pdfFile": {
   "type": "attachment", 
   "analyzer": "my_analyzer"
}

To:

"pdfFile": {
   "type": "attachment", 
   "fields" : {
      "content" : {"type" : "string", "analyzer": "my_analyzer"}
   }
}

And see what happens then?

rodrigue · February 12, 2016, 3:44pm

Yes! You did it... It works. Thank you.
I just had to make one little adaptation to your code. Instead of content, I had to use the same name as the field :



"pdfFile": {

"type": "attachment",

"fields" : {

"pdfFile" : {"type" : "string", "analyzer": "my_analyzer"} // Instead of content, here goes pdfFile, like the attachment field name.

}

}

Thanks a lot for your help. I would never have thought of that myself. Everything works fine now.

dadoonet · February 12, 2016, 4:10pm

Which elasticsearch version are you using?

And by the way, I think it would be good to add this as documentation at https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html.

Contributions are warmly welcomed!

rodrigue · February 12, 2016, 5:11pm

I'm using elasticsearch version 1.7.4
I'll try to update the elasticsearch documentation, and hopefully I'll be able to explain this properly.
Do you think that I should explain the problem in the documentation, or should I propose to map the content field as a best practice?
Your documentation is very well done and I don't want to make a mistake. Will my modification be directly online or will it be controlled by a webmaster first?

dadoonet · February 12, 2016, 6:35pm

We will review it. No worries.

Topic		Replies	Views
Analyzer, mapping et apostrophe Discussions en français	6	2427	July 6, 2017
Attachment Mapper and Searching Elasticsearch	7	894	July 5, 2017
Quoted text search Elasticsearch	2	295	July 6, 2017
[ES 5.0] Simple Query String - highlight issue Elasticsearch	8	1406	January 22, 2017
Attachment arabic Elasticsearch	5	854	July 6, 2017

Document (pdf) containing quotes are not well parsed or queried

Related topics