I'm currently evaluating elastic search for one of my little private software projects
I'm trying to index attachments with the mapper-attachments plugin and the indexing of .txt and .pdf files works fine.
But I think there is an error in the indexing of .docx files. When searching for my test string, I get the PDF and .txt results, but not the .docx results. I already searched for this problem but I can't find a solution for this one.
We need to index documents from a system that can store any doc type. If ES doesn't index a given document we'll need to handle that scenario by reading in the contents ourselves and indexing them separately...
That's what I'd do for all docs instead of sending binary content to Elasticsearch.
So I mean: don't use mapper attachments plugin if you can do text extraction in another way. You can use Tika by yourself actually.
Thanks David, we'll look into this approach - makes sense to me.
Do you know where I can get such a list? Anyone I can ask?
I'd have thought there should be a list somewhere of what doc types the attachment plugin supports...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.