Hi @dadoonet,
I have started using FS crawler recently. We have a requirement to crawl all files on the local file system, most of which are PDFs along with other doc formats. The thing with these PDFs is that all PDFs are not generated from same tool. In particular we have around 15K+ files generated using the Acrobat Capture 3.0 tool. And all the content which we have crawled and indexed into ES are malformed. The FS crawler works like a charm with other PDFs and doc formats when generated using - Acrobat PDFMaker 9.1 for Word or even MS Word.
We tried to troubleshoot on this behavior and could not find any work around for those PDFs.
The example output looks like this:
S a m p l e t e x t is not a r e a l l y w o r k i n g .
some words are correctly formed and some are not.
We have also copied the entire text of PDF and pasted on text editors like notepad++ and found the same output as that of FS crawler. We have also tried with other PDF readers like pdfreader from pypi, though the words are formed well, but bullets are breaking out.
The sample output looks like this:
1.
2.
3.
Dummy text
Lorem Ipsum
Hello World
Please let me know how can we resolve these parsing issues.