Return code for less successful PDF OCR in fscrawler?

ponten · July 3, 2017, 8:05am

Have fscrawler 2.2 workng very well with PDFs.
However, there are documents that were originally scanned and of less good quality.
Of course some will not be possible to interpret, but I would need a way to identify those bad "non-indexed" documents so I can process them aside.

Any idea of how to accomplish it?

Below a sample of such a document. You see a lot of blanks in the content field, not one letter has been parsed.
Are there return codes from the pdf parsing that could be added as for example a TAG? Like that I could easily identify those documents.

{
        "_index": "cndoc",
        "_type": "doc",
        "_id": "5fc69236b36d14962f737a23a54fa3f",
        "_score": 1,
        "_source": {
          "content": """





    """,
              "meta": {},
              "file": {
                "extension": "pdf",
                "content_type": "application/pdf",
                "last_modified": "2017-03-09T09:11:24",
                "indexing_date": "2017-06-29T15:59:01.49",
                "filesize": 274806,
                "filename": "Liste des presents .pdf",
                "url": """file://\tmp\es\Liste des presents .pdf""",
                "indexed_chars": 10000
              },
              "path": {
                "encoded": "824b64ab42d4b63cda6e747e2b80e5",
                "root": "824b64ab42d4b63cda6e747e2b80e5",
                "virtual": "/",
                "real": """\tmp\es\Liste des presents .pdf"""
              }
            }
          },

dadoonet · July 3, 2017, 9:20am

Interesting. May be I can add ignore_empty option?

That said with 2.3-SNAPSHOT you can:

Have OCR if you install as well Tesseract
Call an ingest pipeline where you can run a painless script I think which detects empty content?

If you do anything, I'll be happy to know

ponten · July 10, 2017, 3:41pm

Hello,
have been trying to filter out/identify those "empty" documents in elasticsearch, almost with complete success. As you see, content is not really empty, contains at least a number of "newline" characters.

Did something ugly like this below. Yes I do get hits on the desired documents, but for some reason also on some other docs (some indexed Word documents) that have normal contents. Believe that elastic does not support full regex and I'm not a regex magician.
Have 430 documents indexed.
The "must_not" returns 38 documents.
If I run in with "must", it returns 392, so everything adds up.
Any idea why some normally index documents do get matched by this query?

GET /cndoc/doc/_search
{
  "query": {
    "bool": {
        "must_not": [
          {
            "match": {
              "content": "[a-z|0-9]"
            }
          }
        ]
      }
    }
}

system · August 7, 2017, 3:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some pdf can't be indexed Elasticsearch	3	436	October 22, 2018
Fscrawler Elasticsearch	2	2956	September 28, 2017
Read images in pdf ater indexed in elasticsearch Elasticsearch	4	1318	April 22, 2017
With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6 Elasticsearch	9	808	December 3, 2019
Indexing PDFs directly Elasticsearch	4	709	October 14, 2019

Return code for less successful PDF OCR in fscrawler?

Related topics