Elasticsearch: Best Approach to Index files:

Hello Team

I've few files(pdf, and docx) which has question and answers(think of faq's) . size of all the files will be around 500mb.

Expected output: when we search something, it searches in all the docs and gives the relevant answer

What is the best way to index these files-

1. Index page by page using ingest attachment processor- I think we need to maintain the parent-child relation. I'm afraid when we GET something using match query it will return the whole page and we need to parse it after getting the response. and if question is in one page and answer is in other page, I'm not sure how this works?

2. Extract question and answer from files convert to json and index.- extract to text and convert to json having question and answer as keys and index using elasticsearch client. When I have many files, I'm not sure about the time it takes to convert all files to text and then to json. I think this approach is more suitable for current scenario. But I'm not sure. Please suggest

is there any other method that I need to consider?

Thanks for your time as always

Best
Rahul

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.