Return code for less successful PDF OCR in fscrawler?

Have fscrawler 2.2 workng very well with PDFs.
However, there are documents that were originally scanned and of less good quality.
Of course some will not be possible to interpret, but I would need a way to identify those bad "non-indexed" documents so I can process them aside.

Any idea of how to accomplish it?

Below a sample of such a document. You see a lot of blanks in the content field, not one letter has been parsed.
Are there return codes from the pdf parsing that could be added as for example a TAG? Like that I could easily identify those documents.

{
        "_index": "cndoc",
        "_type": "doc",
        "_id": "5fc69236b36d14962f737a23a54fa3f",
        "_score": 1,
        "_source": {
          "content": """





    """,
              "meta": {},
              "file": {
                "extension": "pdf",
                "content_type": "application/pdf",
                "last_modified": "2017-03-09T09:11:24",
                "indexing_date": "2017-06-29T15:59:01.49",
                "filesize": 274806,
                "filename": "Liste des presents .pdf",
                "url": """file://\tmp\es\Liste des presents .pdf""",
                "indexed_chars": 10000
              },
              "path": {
                "encoded": "824b64ab42d4b63cda6e747e2b80e5",
                "root": "824b64ab42d4b63cda6e747e2b80e5",
                "virtual": "/",
                "real": """\tmp\es\Liste des presents .pdf"""
              }
            }
          },

Interesting. May be I can add ignore_empty option?

That said with 2.3-SNAPSHOT you can:

Have OCR if you install as well Tesseract
Call an ingest pipeline where you can run a painless script I think which detects empty content?

If you do anything, I'll be happy to know :slight_smile:

1 Like

Hello,
have been trying to filter out/identify those "empty" documents in elasticsearch, almost with complete success. As you see, content is not really empty, contains at least a number of "newline" characters.

Did something ugly like this below. Yes I do get hits on the desired documents, but for some reason also on some other docs (some indexed Word documents) that have normal contents. Believe that elastic does not support full regex and I'm not a regex magician.
Have 430 documents indexed.
The "must_not" returns 38 documents.
If I run in with "must", it returns 392, so everything adds up.
Any idea why some normally index documents do get matched by this query?

GET /cndoc/doc/_search
{
  "query": {
    "bool": {
        "must_not": [
          {
            "match": {
              "content": "[a-z|0-9]"
            }
          }
        ]
      }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.