Can I upload pdf file directly instead of Base64 data

johnsonjesumani · March 4, 2019, 10:05am

Hi,
I would like to upload pdf file directly instead of converting the content which inside the pdf to Base64 as the below.

PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": [ "CONTENT", "TITLE", "DATE", "CONTENT_TYPE"]
}
}
]
}

PUT cvattachment/_doc/1?pipeline=attachment
{
"data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8" ( the base64 data is very big. so just pasted the beginning of the converted data)
}

literally, I just want to replace the base64 to the code like below.

"data" : "C:/xxx/xxx.pdf"

dadoonet · March 4, 2019, 10:42am

Ingest attachment can not do that. Specifically because you can't really know where the code will be running (on which node) so it won't have access to C:/xxx/xxx.pdf.

You can have a look at FSCrawler project which does a similar thing though.

johnsonjesumani · March 4, 2019, 10:47am

I can able to place it in the root directory of the elastic server.
For example, In my case the local server is running in the below location.
C:\elasticsearch-6.6.1\bin\elasticsearch.bat
So I can place the documents inside any directories.
Is it possible to upload the pdf document in the above scenario?

dadoonet · March 4, 2019, 11:08am

As I said, it's not. You are thinking here on a single node cluster but Elasticsearch is a multinode cluster system by design.

Node X can't access a dir which is on Node Y running on another machine. That's why this feature is not supported at all.

johnsonjesumani · March 5, 2019, 12:22pm

Thanks a lot @dadoonet for your suggestion to use FSCrawler. It really helped to achieve the POC.

johnsonjesumani · March 5, 2019, 12:27pm

I would like to share the steps that I followed for the POC. So that It would be helpful to anyone who would explore the Elasticsearch as a beginner.

Steps used to achieve the POC:

Title :

Search for the CVs(PDF or Word file which resides in One drive or local) and search for anything in the content using Kibana. For example location worked or the previous company,etc.,

Steps:

Prerequisites:

Install JDB 1.8
Set Java home path

Steps:

https://www.elastic.co/downloads - to download the servers

Start the Elastic search server
Start the Kibana server
Verify once the servers are started using the below link
Link - http://localhost:5601

Start FSCrawler

https://fscrawler.readthedocs.io/en/latest/installation.html - to download the server

open command prompt and navigate to the fscrawler folder, then type - .\bin\fscrawler job1
It will ask whether we can create "Do you want to create it (Y/N)?" - type "Y"
Now we have to change the configuration of the folder to read the files
For example, Navigate to the folder "C:\Users\jesumanij.fscrawler\job1_settings.yaml" and edit the below.
Old : url: "\tmp\es"
New : url: "C:\Users\jesumanij\CV" (don't use the desktop)
Make sure the above folder exists and paste all the files( in our case all CVs) inside the above location

Now again start the FSCrawler with the same command
.\bin\fscrawler job1

Create Index pattern:

Kibana-> Management ->Index Patterns -> Create index pattern -> type "job1"(the same keyword we used while starting the FSCrawler) in the index pattern input -> Click Next -> Choose "file.created" and click "Create index pattern"

Search for the CVs:

Kibana-> Discover
Click the drop down in the left and choose "job1"
make sure right top is having the value "Year to date" to show all date since beginning
Then we can add the below available fields on the left hand side based on the requirement
content, file.filename, file.extension, file.url, file.filesize, etc.,

Refresh the files in the Folder to be available for search:

add the new files in the location (in our case its "C:\Users\jesumanij\CV" )
It will take 15 minutes for auto refresh in FSCrawler server
After 15 minutes, we can click the refresh button in Kibana and check the updated file whether its available for search

dadoonet · March 5, 2019, 12:38pm

Awesome! Would you like to contribute this as part of the FSCrawler documentation? Like a tutorial?

johnsonjesumani · March 5, 2019, 12:39pm

Sure @dadoonet

system · April 2, 2019, 12:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What is the curl command to convert pdf into base64 format? Elasticsearch	15	3504	April 12, 2019
Indexing PDF file in ElasticSearch using Java Code Elasticsearch	2	2645	August 28, 2018
PDF- ingest attachement plugin Elasticsearch	2	476	April 3, 2018
Add large attachments from S3 to Elastic Search Elasticsearch	6	1260	October 5, 2021
Search a PDF file using its content Elasticsearch	9	16293	February 11, 2019

Can I upload pdf file directly instead of Base64 data

Steps used to achieve the POC:

Title :

Steps:

Prerequisites:

Steps:

Start FSCrawler

Create Index pattern:

Search for the CVs:

Refresh the files in the Folder to be available for search:

Related topics