Hi Everyone, when it comes to PDF indexing, is the proper way to do this, is to use a pipeline with and attachment?
So here is the sceario. We have a number of PDFs that range in size from under a meg to over 50. My understanding is that you must base64 the document and add them as an attachment to your index.
Question being, is this the correct way or is there a better option for indexing PDF documents?
Right now, I have a sample node project that basically will base64 the docs, then add them to the index. While this seems to work, it can be a slow process at scale, hence the question about it being the proper / right way to accomplish this.
The next question that comes to mind is, will this account for duplicate documents? What I mean by that, is every time I run the script, it just adds documents, not updating them, so what would be the right way to add the document if its new, but update if its already in the index, further to that, would be, if the document is gone, it should also remove it from the index if that makes sense.
I apologies for the dumb question, I'm super new to ES. When you say encode the file path, would this just be part of the body?
Right now i'm just using client.index and passing over the data with some metadata, so no actual path to the file per say.. (im probably totally missing what you are saying here)
From the FSCrawler side, I was looking at that and it looks like their REST API may work, where we would upload the binary vs base64 then push to the index ( assuming I'm reading that right), then ES would index that as the attachment VS passing the base64 stream. I'm assuming they are essentially doing the same thing though but would be faster with FSCrawler because we would not need to convert it to base64 first.
FSCrawler is definitely a good option, though I believe it outputs to Elasticsearch and Workplace Search, but not App Search (@dadoonet correct me if I'm wrong here!). So if you're tied to App Search, that may not work for you.
What version are you using, and are you using the Web Crawler (PDFs hosted on your site) or are you indexing via API? If you're using the web crawler, I'm excited to inform you that binary content extraction was added in 8.3.0, which recently released. See: Web crawler reference | Elastic App Search Documentation [8.3] | Elastic
If you're not using the web crawler, there's not currently a way to use Ingest Pipelines through the App Search Documents API. However, App Search did recently add the capability to search Elasticsearch Indexes. So you could index directly to Elasticsearch (using FSCrawler or Elasticsearch APIs + Ingest Pipeline). See: Elasticsearch index engines (technical preview) | Elastic App Search Documentation [8.3] | Elastic (note that this feature is a Technical Preview).
Stay tuned to future releases, as my team is actively working on democratizing Binary Content Extraction across the Enterprise Search product. If you have a support relationship with Elastic, I'd encourage you to file an Enhancement Request with us, so that you use case can be top-of-mind as we prioritize where to add these features next.
Right now we are on a trial, as we are looking to replace our current seach tool ( idole) so this is all new to me
Ultimatly, i would love to be able to use the regular web crawler, however i was not able to get that working for pdfs, it only indexed the page not the pdf itself ( would make my life simpler if i didnt have to create custom indexer / pipeline )
If the crawler can do this, would totally prefer that method, i guess the question would be, is will it also index other meta data associated with the file, for example, right now, we index against a json file that contains a link to the file but also things like department name, date etc.
The goal is, to be able to do a search and have it search the pdf and also be able to narow it down to say the department between date a and b
If the crawler can do that ( please say yes ) that would be great... and if the answer is yes, how
will it also index other meta data associated with the file, for example, right now, we index against a json file that contains a link to the file but also things like department name, date etc.
No. I'm not sure where you're getting "department name" for a PDF, but that's not something that App Search will be able to intuit. It will grab metadata like the content type (pdf), title, URL, last-crawled-date, and additional URLs. If the URL implies the department, via path, that might be something you can infer via the path segments that we index as well. But if you need totally custom metadata that cannot be inferred, then you probably need to do that yourself.
I have not had a chance to dig back into this due to other projects pulling me away so I do apologise. I'm open to chat directly, not sure the best way to do that..
The basic use case is
Right now, we use the another search tool. This has become problematic because of limited support / knowledgeable people who can administer the tool. We are looking to migrate to a more supported tool, however we do have some requirements, one of them being that we must be able to perform document searches, in addition to a regular web-search.
As noted, our current flow is
Documents are ingested into a document repository
Once processed, a JSON file is created, that contains all the documents for a specific collection, this JSON file contains things like
[{
document_title: "Some title",
department : "department",
document_url: "https://use.to.document"
}]
There are many more elements but generally it's just a collection of properties associated with the document
3) When the crawler runs, it runs against that JSON file, and each JSON element is indexed by the crawler and the URL portion is used as the document attachment with the contents of the document also indexed.
Once the collection is created, we have a search interface that allows us to perform a regular 'text' search, that will search the contents of the document ( usually PDF ), but we have additional filters as well, where a user can select department, date as well as other sub categories that may be listed in the JSON file.
We are not tied to it being a JSON file, so if ES can perform the same type of ingesting just with a different flow, I think that's doable, even using a pipeline if necessary as long as the desired results remain the same.
Anyhow, let me know how we can connect and I can go into detail a bit more and see if ES may be a viable solution.
Hey @MK817664 it definitely would be feasible to do that workflow and give you additional tools on top of what you're using to be able to improve the search experience and tweak it. I'll send you a direct message to find some time to connect.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.