Hello, I've got a little problem. I'm indexing documents (PDF)
in french using the excellent plugin "mapper-attachment". We are using
elastic-search version 1.7.4 and have installed the plugin corresponding
to the right version (at least I hope)
The plugin has been installed like that: sudo
/usr/share/elasticsearch/bin/plugin install
elasticsearch/elasticsearch-mapper-attachments/2.7.0
Everything works fine except for words with quotes. In french we need to
separate words using quotes for exemple " the attention " is translated
as " l'attention ". When I index an attachment having the words "
l'attention " and when I search for "attention", it doesn't match.
When I index a regular String using the same analyzer it works fine. "
l'attention " matches with "attention". The french elision filter works
fine on regular Strings but not on attachment.
I hope someone will be able to help me.
If needed I can provide you a testcase that shows exactly the problem.
OK. I wrote a simple example to show my problem.
In order to work properly, the script and the pdf file have to be in the same directory. (it work well on any bash, even git bash on windows)
Since I could not upload a pdf or an sh script in the reply, I temporarily pushed it on my website.
You can find the example at this address... (If there is a better way to do it, tell me and I will do it)
Before looking for help, that's exactly what I did. I tried a lot of different combination, different analysers, etc... When I execute the _analyze API with l'attention, it returns the word attent which is exactly what I was expecting...
But, for some reason, it doesn't work on attachments... How could I execute _analyze on an attachment?
Oh, yeah. One more detail. When I index a word attachment (.doc), it works perfectly (even with my full analyser, including synonyms) but when I index a PDF (with the same content) it doesn't work. Strange, no?
Sorry. I think that I may have been wrong about the difference between word and PDFs. I tried to reproduce it, this morning and I couldn't. I got the same result with one and the other.
The difference is that, when the text is copied and paste between the PDF and word, quotes are changed into apostrophes, some spaces are added, etc... Word is modifying the text. This explains why it worked on word and not PDFs.
But, when the content is strictly the same, I have the same result on both files...
Yes! You did it... It works. Thank you.
I just had to make one little adaptation to your code. Instead of content, I had to use the same name as the field :
"pdfFile": {
"type": "attachment",
"fields" : {
"pdfFile" : {"type" : "string", "analyzer": "my_analyzer"} // Instead of content, here goes pdfFile, like the attachment field name.
}
}
Thanks a lot for your help. I would never have thought of that myself. Everything works fine now.
I'm using elasticsearch version 1.7.4
I'll try to update the elasticsearch documentation, and hopefully I'll be able to explain this properly.
Do you think that I should explain the problem in the documentation, or should I propose to map the content field as a best practice?
Your documentation is very well done and I don't want to make a mistake. Will my modification be directly online or will it be controlled by a webmaster first?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.