Can i create index for pdf file?
How?
You'll want the ingest-attachment plugin.
You can have a look at Implementing Ingest Attachment Processor Plugin for an example which will help you I think
If you have PDF docs in a folder you can use FSCrawler project as well.
Thanks both of you guys,
Can i give path of file in ingest plugin?
No.
You need to push data to ingest. Ingest can not pull data from a source.
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "path"
}
}
]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
"path": { "type":"{path.home}/data/base64.bin" }
}
Can i use path?
Is there any other plugin i can use with ingest?
No you can't.
But I mentioned FSCrawler project already.
FScrawler is not working it is not even showing any error.
Command fscrawler job_name stuck for an hour on CLI.
I am using WINDOWS 10 and fscrawler 2.2.
Please help me
Yes. It's because of this:
Use the latest SNAPSHOT from here: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/
Thanks, now i am able to create index in elastic search.
Can i create index for images , which are inside pdf?
Do you mean? OCR?
If so, this (https://github.com/dadoonet/fscrawler#ocr-integration-using-tika-and-tesseract) is supposed to work. But someone reported that it might be buggy actually:
I installed tesseract but when i execute any exe inside it it is not running?
I really don't know. I'm sorry. I need to find some spare time to play with it but never got a chance to find a couple of hours yet.
Thanks for helping me
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.