HI,
I was able to upload various kinds of files using FS Crawler to ES. However when I upload a bam files (which is a huge files generated in WGS pipelines), only the name of the file is getting stored in the content filed.
I would like to know if there is a way ES can read the contents of the file and index the same so that we can search based in the content.
Are there any limitation (or specific files for which the content is not extracted)?. Any pointers or help in this is much appreciated.
I could not see the .bam file specifically. But it is a binary file and i assumed that it would work. I had two questions here:
There is not error generated or no warning shown that the file format is not supported or something similar.
I am able to see the metadata of the file and why is it that the file name is stored in the content (field) of the document.
Appreciate if you clarify these two points. I am working with my team to get more details to see if there are any alternative options we can consider (like a different file format or a different file itself)
FSCrawler tried its best to extract the most information as possible. Some metadata are generated by FSCrawler.
The rest is extracted by Apache Tika.
If Tika can not extract content, it is just ignored.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.