I'm indexing a website which has a lot of files on it.
I found the attachment plugin which handles all file types we have, but our
files are not "attached" (associated) with a particular web page -- in many
cases the same file is attached to multiple pages. So we want files to show
in the search results alongside other items.
I can extract data from the file myself using Apache Tika and index it as
with any other document in the system; but given Tika runs inside the
attachment plugin, is there any way to use the built-in system?
If you can do it by yourself and use Tika directly, I’d definitely do that and don’t use the mapper attachment plugin.
You will have more control on what you exactly want to do than with the mapper attachment plugin.
I'm indexing a website which has a lot of files on it.
I found the attachment plugin which handles all file types we have, but our files are not "attached" (associated) with a particular web page -- in many cases the same file is attached to multiple pages. So we want files to show in the search results alongside other items.
I can extract data from the file myself using Apache Tika and index it as with any other document in the system; but given Tika runs inside the attachment plugin, is there any way to use the built-in system?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.