I've few files(pdf, and docx) which has question and answers(think of faq's) . size of all the files will be around 500mb.
Expected output: when we search something, it searches in all the docs and gives the relevant answer
What is the best way to index these files-
1. Index page by page using ingest attachment processor- I think we need to maintain the parent-child relation. I'm afraid when we GET something using match query it will return the whole page and we need to parse it after getting the response. and if question is in one page and answer is in other page, I'm not sure how this works?
2. Extract question and answer from files convert to json and index.- extract to text and convert to json having question and answer as keys and index using elasticsearch client. When I have many files, I'm not sure about the time it takes to convert all files to text and then to json. I think this approach is more suitable for current scenario. But I'm not sure. Please suggest
is there any other method that I need to consider?
Thanks for your time as always