Have fscrawler 2.2 workng very well with PDFs.
However, there are documents that were originally scanned and of less good quality.
Of course some will not be possible to interpret, but I would need a way to identify those bad "non-indexed" documents so I can process them aside.
Any idea of how to accomplish it?
Below a sample of such a document. You see a lot of blanks in the content field, not one letter has been parsed.
Are there return codes from the pdf parsing that could be added as for example a TAG? Like that I could easily identify those documents.
{
"_index": "cndoc",
"_type": "doc",
"_id": "5fc69236b36d14962f737a23a54fa3f",
"_score": 1,
"_source": {
"content": """
""",
"meta": {},
"file": {
"extension": "pdf",
"content_type": "application/pdf",
"last_modified": "2017-03-09T09:11:24",
"indexing_date": "2017-06-29T15:59:01.49",
"filesize": 274806,
"filename": "Liste des presents .pdf",
"url": """file://\tmp\es\Liste des presents .pdf""",
"indexed_chars": 10000
},
"path": {
"encoded": "824b64ab42d4b63cda6e747e2b80e5",
"root": "824b64ab42d4b63cda6e747e2b80e5",
"virtual": "/",
"real": """\tmp\es\Liste des presents .pdf"""
}
}
},