Can I upload pdf file directly instead of Base64 data

Hi,
I would like to upload pdf file directly instead of converting the content which inside the pdf to Base64 as the below.

PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": [ "CONTENT", "TITLE", "DATE", "CONTENT_TYPE"]
}
}
]
}

PUT cvattachment/_doc/1?pipeline=attachment
{
"data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8" ( the base64 data is very big. so just pasted the beginning of the converted data)
}

literally, I just want to replace the base64 to the code like below.

"data" : "C:/xxx/xxx.pdf"

Ingest attachment can not do that. Specifically because you can't really know where the code will be running (on which node) so it won't have access to C:/xxx/xxx.pdf.

You can have a look at FSCrawler project which does a similar thing though.

I can able to place it in the root directory of the elastic server.
For example, In my case the local server is running in the below location.
C:\elasticsearch-6.6.1\bin\elasticsearch.bat
So I can place the documents inside any directories.
Is it possible to upload the pdf document in the above scenario?

As I said, it's not. You are thinking here on a single node cluster but Elasticsearch is a multinode cluster system by design.

Node X can't access a dir which is on Node Y running on another machine. That's why this feature is not supported at all.

2 Likes

Thanks a lot @dadoonet for your suggestion to use FSCrawler. It really helped to achieve the POC.

I would like to share the steps that I followed for the POC. So that It would be helpful to anyone who would explore the Elasticsearch as a beginner.

Steps used to achieve the POC:

Title :

Search for the CVs(PDF or Word file which resides in One drive or local) and search for anything in the content using Kibana. For example location worked or the previous company,etc.,

Steps:

Prerequisites:

  1. Install JDB 1.8
  2. Set Java home path

Steps:

https://www.elastic.co/downloads - to download the servers

  1. Start the Elastic search server
  2. Start the Kibana server
  3. Verify once the servers are started using the below link
    Link - http://localhost:5601

Start FSCrawler

https://fscrawler.readthedocs.io/en/latest/installation.html - to download the server

  1. open command prompt and navigate to the fscrawler folder, then type - .\bin\fscrawler job1
  2. It will ask whether we can create "Do you want to create it (Y/N)?" - type "Y"
  3. Now we have to change the configuration of the folder to read the files
    For example, Navigate to the folder "C:\Users\jesumanij.fscrawler\job1_settings.yaml" and edit the below.
    Old : url: "\tmp\es"
    New : url: "C:\Users\jesumanij\CV" (don't use the desktop)
    Make sure the above folder exists and paste all the files( in our case all CVs) inside the above location

Now again start the FSCrawler with the same command
.\bin\fscrawler job1

Create Index pattern:

Kibana-> Management ->Index Patterns -> Create index pattern -> type "job1"(the same keyword we used while starting the FSCrawler) in the index pattern input -> Click Next -> Choose "file.created" and click "Create index pattern"

Search for the CVs:

Kibana-> Discover
Click the drop down in the left and choose "job1"
make sure right top is having the value "Year to date" to show all date since beginning
Then we can add the below available fields on the left hand side based on the requirement
content, file.filename, file.extension, file.url, file.filesize, etc.,

Refresh the files in the Folder to be available for search:

  1. add the new files in the location (in our case its "C:\Users\jesumanij\CV" )
  2. It will take 15 minutes for auto refresh in FSCrawler server
  3. After 15 minutes, we can click the refresh button in Kibana and check the updated file whether its available for search
2 Likes

Awesome! Would you like to contribute this as part of the FSCrawler documentation? Like a tutorial?

Sure @dadoonet

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.