Grep-like results on elasticsearch index

John10 · February 26, 2023, 4:20pm

If you have 5,000 pdf documents and you want to return every instance of the word "dog" across all of those documents (including the context where it occurs -- page number, the line before and after the match, etc.), you could use a grep-like utility (pdfgrep, etc.); however, this is (relatively) slow and doesn't use any sort of index.

Elasticsearch works at the document level and returns documents as opposed to individual matches within and across each and every document. The use-case is just different. It looks like it might be possible to have fscrawler index something like paragraphs or sentences as nested objects of the document and then return the hits within those nested objects across the entire index.

Is that really feasible?
Would that actually be faster than just using grep? (I don't see how it couldn't be when dealing with thousands of documents but I wonder.)
Is there some other obvious solution I'm missing?

Basically, what I'm after is using grep with a full-text index to get grep-like results with the speed advantages of a full-text index.

Thanks,
John

dadoonet · February 26, 2023, 7:42pm

FSCrawler can not (yet) index paragraphs or even pages.
So it will just help to know in which document you can see the term.

That said you can use the highlighter to get more information about the context.

system · March 26, 2023, 7:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search within PDF files Elasticsearch	9	5534	August 26, 2017
Find matching word in text file using elk Elasticsearch	14	1703	June 3, 2019
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3853	June 26, 2018
FSCrawler large document and indexing based on content Elasticsearch	4	2352	December 28, 2017
Recommended way to index html documents Elasticsearch	1	940	January 11, 2017

Grep-like results on elasticsearch index

Related topics