If you have 5,000 pdf documents and you want to return every instance of the word "dog" across all of those documents (including the context where it occurs -- page number, the line before and after the match, etc.), you could use a grep-like utility (pdfgrep, etc.); however, this is (relatively) slow and doesn't use any sort of index.
Elasticsearch works at the document level and returns documents as opposed to individual matches within and across each and every document. The use-case is just different. It looks like it might be possible to have fscrawler index something like paragraphs or sentences as nested objects of the document and then return the hits within those nested objects across the entire index.
-
Is that really feasible?
-
Would that actually be faster than just using grep? (I don't see how it couldn't be when dealing with thousands of documents but I wonder.)
-
Is there some other obvious solution I'm missing?
Basically, what I'm after is using grep with a full-text index to get grep-like results with the speed advantages of a full-text index.
Thanks,
John