Search a PDF file using its content


(Mani Sundharram) #1

Hi,
I have installed and setup elastic search and ingest-attachment plugin.

I need to search through a list of pdf files (20000) given in a file path how would I do.

  1. How do you push the data to the elastic search , is there a way to mention the file path directly to elastic search in the request itself. (prefer not to use any programming language like C# or python etc.).
    Note:
    I Used FS Crawler to import the PDF file contents from a local file system path into Elastic Search.
  2. How to push the file contents into ingest node.
  3. Hows ingest-attachment plugin works.
  4. I need to restrict the search results based on user access. How to achieve it.

(David Pilato) #2
  1. I believe that FSCrawler does that. If I understood what you asked for
  2. You need to serialize the binary to BASE64 and send the BASE64 within a field of your json document. There's a demo in documentation.
  3. It uses Tika behind the scene to extract the text from the document and put it into your source document
  4. You can use security feature of elastic stack. You need to activate a trial license or buy a platinum license. Or use cloud.elastic.co

(Mani Sundharram) #3

Can you share documentation URL ?


(David Pilato) #4

https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html


(Mani Sundharram) #5
Hi David ,
   Thank you for your early responses. It would be greatful if u clarify the below mentioned doubts.

C:\Users\Administrator.fscrawler\job1_settings.json
"elasticsearch" : {
"index" : "jobindex1",
"index_folder" : "jobfoldersindex1",
"pipeline" : "fscrawler",
"nodes" : [ {
"url" : "http://127.0.0.1:9200"
} ],
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "25mb"
}

the above is my settings in fscrawler w.r.t elasticsearch.

Request to create a pipeline
PUT _ingest/pipeline/fscrawler
{
"description" : "fscrawler pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}

Files are imported into elasticsearch successfully.

Request to get files which contains the below mentioned string
GET /jobindex1/_search
{
"query" : {
"match": {
"content" : "emad"
}
}
}

Result:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 3.15744,
"hits" : [
{
"_index" : "jobindex1",
"_type" : "_doc",
"_id" : "7bfc1ba6cb2ea96a7cea1b84f4dbd",
"_score" : 3.15744,
"_source" : {
"path" : {
"virtual" : "/Attendance_Appraisals_07022017120507.pdf",
"root" : "8f384e4d1aa1e6127ed1195953dccce3",
"real" : """D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
},
"file" : {
"extension" : "pdf",
"last_accessed" : "2019-01-13T06:41:24.289+0000",
"filename" : "Attendance_Appraisals_07022017120507.pdf",
"content_type" : "application/pdf",
"created" : "2019-01-13T06:41:24.289+0000",
"indexing_date" : "2019-01-13T12:04:23.798+0000",
"filesize" : 509153,
"last_modified" : "2017-02-07T09:05:03.872+0000",
"url" : """file://D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
},
"meta" : {
"created" : "2017-02-07T02:59:52.000+0000",
"format" : "application/pdf; version=1.3",
"raw" : {
"pdf:PDFVersion" : "1.3",
"X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
"xmp:CreatorTool" : "Canon ",
"access_permission:modify_annotations" : "true",
"access_permission:can_print_degraded" : "true",
"meta:creation-date" : "2017-02-07T06:59:52Z",
"created" : "2017-02-07T06:59:52Z",
"access_permission:extract_for_accessibility" : "true",
"access_permission:assemble_document" : "true",
"xmpTPg:NPages" : "1",
"Creation-Date" : "2017-02-07T06:59:52Z",
"resourceName" : "Attendance_Appraisals_07022017120507.pdf",
"dcterms:created" : "2017-02-07T06:59:52Z",
"dc:format" : "application/pdf; version=1.3",
"access_permission:extract_content" : "true",
"access_permission:can_print" : "true",
"pdf:docinfo:creator_tool" : "Canon ",
"access_permission:fill_in_form" : "true",
"pdf:encrypted" : "false",
"producer" : " ",
"access_permission:can_modify" : "true",
"pdf:docinfo:producer" : " ",
"pdf:docinfo:created" : "2017-02-07T06:59:52Z",
"Content-Type" : "application/pdf"
},
"creator_tool" : "Canon "
},
"foo" : "bar",
"content" : """

    A.
    NPtr

    INTER OFFICE MEMO

    Dear All Employees

    ln reference to the above mentioned subject, irrespective of numerous correspondences, it
    has been noticed that many employees are still reporting to work late on many occasions.
    The grace period for morning Punch lN time is only 15 minutes from the official start timing
    irrespective of Head Office or Sites. Late attendance will be deducted from the monthly
    salary. Also PUNCH IN/OUT is mandatory. The Missed Punching will also be considered as
    Absent.

    Also note that the late Punching and related deduction will be affecting the Performance
    Appraisal of the employees.

    ln view of all the above all staff are requested to do proper attendance punching and if any
    technical issue please coordinate with the IT/HR department to rectify the same at the
    earliest will be given to any staff on the attendance punching

    E Janabi
    HR & Admin Manager

    Ref No. Trojan/lOM/HR & ADM/44581 17 Date: 07to2t2017 Pages 1

    To All Staff TROJAN & NPC

    From Emad AI Janabi HR & Admin Manager

    CC: Engr. Hamad Al Ameri Managing Director

    Subject Attendance Regulations & Performance Appraisals

    P.o. Box 111059, Abu Dhabi, uAE. Tel. no. +9t1 2 so973oo - Fax: +gl1 2 5g2gs94

    .i - oi

    I'l{( )f n N



    """
            }
          }
        ]
      }
    }

Why i am not able to get the result like
{
"found": true,
"_index": "my_index",
"_type": "_doc",
"_id": "my_id",
"_version": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}


(David Pilato) #6

If you are using FSCrawler you don't need ingest attachment plugin at all as everything is done by FSCrawler.

What is the output you need? You don't want to index the content?


(Mani Sundharram) #7

If everything is done by FSCrawler that is happy to know.
I need to search inside the content of more than 20000 pdf files and return the list of matching file ist to the user, to do so Is this the efficient way, or using ingest plugin is efficient ?


(David Pilato) #8

It depends.

If you need to crawl a filesystem, then FSCrawler is good. If you just need to index one binary file you have wherever, then probably ingest attachment is ok.
But it doesn't expose all Tika features such as OCR. In which case FSCrawler would be preferred.

Disclaimer: I'm the author of FSCrawler so I might be biased. :wink:


(Mani Sundharram) #9

Thank you david, I have successfully implemented the search.