Pdf Parsing Issues when using FS crawler

Hi @dadoonet,

I have started using FS crawler recently. We have a requirement to crawl all files on the local file system, most of which are PDFs along with other doc formats. The thing with these PDFs is that all PDFs are not generated from same tool. In particular we have around 15K+ files generated using the Acrobat Capture 3.0 tool. And all the content which we have crawled and indexed into ES are malformed. The FS crawler works like a charm with other PDFs and doc formats when generated using - Acrobat PDFMaker 9.1 for Word or even MS Word.

We tried to troubleshoot on this behavior and could not find any work around for those PDFs.

The example output looks like this:

S a m p l e t e x t is not a r e a l l y w o r k i n g .

some words are correctly formed and some are not.

We have also copied the entire text of PDF and pasted on text editors like notepad++ and found the same output as that of FS crawler. We have also tried with other PDF readers like pdfreader from pypi, though the words are formed well, but bullets are breaking out.

The sample output looks like this:

1.
2.
3.
Dummy text
Lorem Ipsum
Hello World

Please let me know how can we resolve these parsing issues.

Welcome!

It would help if you could share a sample file so I can test it and be some options in Tika if any.

Thanks for the swift reply. I am waiting for a sample PDF that can be shared here. I really appreciate your patience.

Ideally share it within a new issue in GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler). So we can track this :blush:

1 Like

Surely will follow the suggestion. :grinning: