How to index files?

alorenc · October 27, 2021, 1:14pm

I have added some pages to AppSearch. The documentation shows that Crawler does not analyze the documents (pdf / docx) inside the page.

I am interested in whether the links to these documents are stored somewhere (I couldn't find them).
If so, is it possible to somehow automate with a plugin "ingest-attachment" to parse data from these files?
if not, are there any solutions that would help me with this, do I have to write an external script that will collect all documents, convert them to base64 and pass through API for indexing
Can you set the data to be passed to a specific index in elasticseach using the previously mentioned plugin?
Is the previously mentioned plug-in able to handle duplicate files?
if I save the data from files under the selected index, how to combine them with the crawler data so that the prepared search engine uses both
Is there a way to verify the source, such as the content of a page or file, by configuring additional search fields for SearchUI.

I would like to solve this without using FSCrawler.

orhantoy · October 28, 2021, 10:28am

All URLs that we find during a crawl will have a corresponding event logged together with a decision on whether it was denied or allowed. You can read more about those logs here.

Topic		Replies	Views
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7749	March 29, 2021
Indexing word, pdf documents? Elasticsearch	12	6015	July 7, 2020
Index the entire file content Elastic Search elastic-app-search	7	994	February 25, 2021
Indexing file (.doc,.pdf.xls etc) Elasticsearch	7	2712	July 5, 2017
Index PDF in ES Elasticsearch	14	9096	April 24, 2017

How to index files?

Related topics