Pdf Parsing Issues when using FS crawler

saifstech26 · August 6, 2024, 3:32pm

I have started using FS crawler recently. We have a requirement to crawl all files on the local file system, most of which are PDFs along with other doc formats. The thing with these PDFs is that all PDFs are not generated from same tool. In particular we have around 15K+ files generated using the Acrobat Capture 3.0 tool. And all the content which we have crawled and indexed into ES are malformed. The FS crawler works like a charm with other PDFs and doc formats when generated using - Acrobat PDFMaker 9.1 for Word or even MS Word.

We tried to troubleshoot on this behavior and could not find any work around for those PDFs.

The example output looks like this:

S a m p l e t e x t is not a r e a l l y w o r k i n g .

some words are correctly formed and some are not.

We have also copied the entire text of PDF and pasted on text editors like notepad++ and found the same output as that of FS crawler. We have also tried with other PDF readers like pdfreader from pypi, though the words are formed well, but bullets are breaking out.

The sample output looks like this:

1.
2.
3.
Dummy text
Lorem Ipsum
Hello World

Please let me know how can we resolve these parsing issues.

dadoonet · August 6, 2024, 4:20pm

Welcome!

It would help if you could share a sample file so I can test it and be some options in Tika if any.

saifstech26 · August 7, 2024, 8:30am

Thanks for the swift reply. I am waiting for a sample PDF that can be shared here. I really appreciate your patience.

dadoonet · August 7, 2024, 8:46am

Ideally share it within a new issue in GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler). So we can track this

saifstech26 · August 7, 2024, 9:07am

Surely will follow the suggestion.

Topic		Replies	Views
Some pdf can't be indexed Elasticsearch	3	434	October 22, 2018
With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6 Elasticsearch	9	807	December 3, 2019
Read image text from pdf Elasticsearch	54	5416	June 7, 2017
ElasticSearch Indexing question Elasticsearch	22	3839	July 5, 2017
Tif files in fscrawler Elasticsearch	25	2050	June 22, 2020

Pdf Parsing Issues when using FS crawler

Related topics