Search within PDF files


#1

Hi, we're using Elastic's API to search document repository we created for our users (about 6,000 documents). Most of the documents are PDF files and we're having difficulty to offer our customers with true value by helping them better understand what each document includes, without opening it, but present quotes of interior text that include the searched words/ expression as part of a context.

Google has the search operator option that provides users the option to see 1st, 2nd or even 3rd sentences beneath each result and that include the specific words searched (see below).
I’ll be happy to further understand what are the options to do such thing in order to improve our search results and better engage with our users. thanks


(David Pilato) #2

Highlighting is the way to go I believe. See https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-highlighting.html


#3

Thanks, we're already using Highlighting and it helps us to show highlighted headlines, but still - it doesn't help us with presenting quotes from within the document itself so users can see one/ two sentences that include the search terms within the document context (as presented at the screenshot I've added from Google).
Can we have something like it?


(David Pilato) #4

I don't see why it would not work for the document content as well. At least I was using something similar but it was some years ago.


#5

Thanks. The thing is an action taken before highlighting it - the way to initially present the text itself as part of the result page and not by entering a specific document. Am I missing something? How can we do it? Thanks


(David Pilato) #6

In your google example you were searching for coins, right?
Do you want to print any content even with a match all query ?

Anyway. You have the content indexed so you can just extract the x first characters may be?


#7

Hi David, yes - was searching for coins, but it could have been also a whole sentence that is searched.
I want to present ant content within the PDF that includes the word/ sentence/ part of the sentence that was being searched.
I'm not sure what you offer. All we have now is the documents indexed by general filters. The option I'm looking for is to allow us to do it better engage the users. Any offer? Thanks!


(David Pilato) #8

So highlighting is the way to go.

If it does not work for you please provide a full example we can play with.


(Minh Hoang, Nguyen) #9

@liavch As @dadoonet's suggestion, you should go with highlighting in Elastic Search. Let try example query and see result:

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : { "your-field" : "search keyword" }
    },
    "highlight" : {
    	"pre_tags" : ["<b>"],
        "post_tags" : ["</b>"],
        "fields" : {
            "your-field-above" : {
            	"fragment_size" : 30, 
            	"number_of_fragments" : 10
            }
        },
        "order" : "score"
    }
}
'

(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.