Search a PDF file using its content

Mani_Sundharram · January 13, 2019, 7:40am

Hi,
I have installed and setup elastic search and ingest-attachment plugin.

I need to search through a list of pdf files (20000) given in a file path how would I do.

How do you push the data to the elastic search , is there a way to mention the file path directly to elastic search in the request itself. (prefer not to use any programming language like C# or python etc.).
Note:
I Used FS Crawler to import the PDF file contents from a local file system path into Elastic Search.
How to push the file contents into ingest node.
Hows ingest-attachment plugin works.
I need to restrict the search results based on user access. How to achieve it.

dadoonet · January 13, 2019, 7:57am

I believe that FSCrawler does that. If I understood what you asked for
You need to serialize the binary to BASE64 and send the BASE64 within a field of your json document. There's a demo in documentation.
It uses Tika behind the scene to extract the text from the document and put it into your source document
You can use security feature of elastic stack. You need to activate a trial license or buy a platinum license. Or use cloud.elastic.co

Mani_Sundharram · January 13, 2019, 9:16am

Can you share documentation URL ?

dadoonet · January 13, 2019, 9:28am

https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

Mani_Sundharram · January 13, 2019, 1:29pm

Hi David ,
   Thank you for your early responses. It would be greatful if u clarify the below mentioned doubts.

C:\Users\Administrator.fscrawler\job1_settings.json
"elasticsearch" : {
"index" : "jobindex1",
"index_folder" : "jobfoldersindex1",
"pipeline" : "fscrawler",
"nodes" : [ {
"url" : "http://127.0.0.1:9200"
} ],
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "25mb"
}

the above is my settings in fscrawler w.r.t elasticsearch.

Request to create a pipeline
PUT _ingest/pipeline/fscrawler
{
"description" : "fscrawler pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}

Files are imported into elasticsearch successfully.

Request to get files which contains the below mentioned string
GET /jobindex1/_search
{
"query" : {
"match": {
"content" : "emad"
}
}
}

Result:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 3.15744,
"hits" : [
{
"_index" : "jobindex1",
"_type" : "_doc",
"_id" : "7bfc1ba6cb2ea96a7cea1b84f4dbd",
"_score" : 3.15744,
"_source" : {
"path" : {
"virtual" : "/Attendance_Appraisals_07022017120507.pdf",
"root" : "8f384e4d1aa1e6127ed1195953dccce3",
"real" : """D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
},
"file" : {
"extension" : "pdf",
"last_accessed" : "2019-01-13T06:41:24.289+0000",
"filename" : "Attendance_Appraisals_07022017120507.pdf",
"content_type" : "application/pdf",
"created" : "2019-01-13T06:41:24.289+0000",
"indexing_date" : "2019-01-13T12:04:23.798+0000",
"filesize" : 509153,
"last_modified" : "2017-02-07T09:05:03.872+0000",
"url" : """file://D:\PDf list\Attendance_Appraisals_07022017120507.pdf"""
},
"meta" : {
"created" : "2017-02-07T02:59:52.000+0000",
"format" : "application/pdf; version=1.3",
"raw" : {
"pdf:PDFVersion" : "1.3",
"X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
"xmp:CreatorTool" : "Canon ",
"access_permission:modify_annotations" : "true",
"access_permission:can_print_degraded" : "true",
"meta:creation-date" : "2017-02-07T06:59:52Z",
"created" : "2017-02-07T06:59:52Z",
"access_permission:extract_for_accessibility" : "true",
"access_permission:assemble_document" : "true",
"xmpTPg:NPages" : "1",
"Creation-Date" : "2017-02-07T06:59:52Z",
"resourceName" : "Attendance_Appraisals_07022017120507.pdf",
"dcterms:created" : "2017-02-07T06:59:52Z",
"dc:format" : "application/pdf; version=1.3",
"access_permission:extract_content" : "true",
"access_permission:can_print" : "true",
"pdf:docinfo:creator_tool" : "Canon ",
"access_permission:fill_in_form" : "true",
"pdf:encrypted" : "false",
"producer" : " ",
"access_permission:can_modify" : "true",
"pdf:docinfo:producer" : " ",
"pdf:docinfo:created" : "2017-02-07T06:59:52Z",
"Content-Type" : "application/pdf"
},
"creator_tool" : "Canon "
},
"foo" : "bar",
"content" : """

    A.
    NPtr

    INTER OFFICE MEMO

    Dear All Employees

    ln reference to the above mentioned subject, irrespective of numerous correspondences, it
    has been noticed that many employees are still reporting to work late on many occasions.
    The grace period for morning Punch lN time is only 15 minutes from the official start timing
    irrespective of Head Office or Sites. Late attendance will be deducted from the monthly
    salary. Also PUNCH IN/OUT is mandatory. The Missed Punching will also be considered as
    Absent.

    Also note that the late Punching and related deduction will be affecting the Performance
    Appraisal of the employees.

    ln view of all the above all staff are requested to do proper attendance punching and if any
    technical issue please coordinate with the IT/HR department to rectify the same at the
    earliest will be given to any staff on the attendance punching

    E Janabi
    HR & Admin Manager

    Ref No. Trojan/lOM/HR & ADM/44581 17 Date: 07to2t2017 Pages 1

    To All Staff TROJAN & NPC

    From Emad AI Janabi HR & Admin Manager

    CC: Engr. Hamad Al Ameri Managing Director

    Subject Attendance Regulations & Performance Appraisals

    P.o. Box 111059, Abu Dhabi, uAE. Tel. no. +9t1 2 so973oo - Fax: +gl1 2 5g2gs94

    .i - oi

    I'l{( )f n N



    """
            }
          }
        ]
      }
    }

Why i am not able to get the result like
{
"found": true,
"_index": "my_index",
"_type": "_doc",
"_id": "my_id",
"_version": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}

dadoonet · January 13, 2019, 1:55pm

If you are using FSCrawler you don't need ingest attachment plugin at all as everything is done by FSCrawler.

What is the output you need? You don't want to index the content?

Mani_Sundharram · January 13, 2019, 2:04pm

If everything is done by FSCrawler that is happy to know.
I need to search inside the content of more than 20000 pdf files and return the list of matching file ist to the user, to do so Is this the efficient way, or using ingest plugin is efficient ?

dadoonet · January 13, 2019, 2:24pm

It depends.

If you need to crawl a filesystem, then FSCrawler is good. If you just need to index one binary file you have wherever, then probably ingest attachment is ok.
But it doesn't expose all Tika features such as OCR. In which case FSCrawler would be preferred.

Disclaimer: I'm the author of FSCrawler so I might be biased.

Mani_Sundharram · January 14, 2019, 3:48am

Thank you david, I have successfully implemented the search.

system · February 11, 2019, 3:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search froma a pdf file content Elasticsearch	9	485	July 23, 2020
Index PDF in ES Elasticsearch	14	9128	April 24, 2017
Using ingest-attachment plugin Elasticsearch	11	1240	December 21, 2016
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7818	March 29, 2021
How to specify file to Ingest Attachment Elasticsearch	11	4801	March 21, 2017

Search a PDF file using its content

Related topics