Avoiding the irrelevant documents in Elasticsearch

rahulnama · December 27, 2018, 3:29pm

Hi Team

I've indexed few pdf files related to computer problems. Each document represents a book.

Whenever user searches a query we are giving the url of the book as a response.

query:

GET testbooks/_search
{    "_source": "url", 
    "query": {
        "match" : {
            "content" : {
                "query" : "light is not working",   
                "operator" : "and"
            }
        }
    }
}

Expected output:

"hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }

Actual Output:

"hits": {
    "total": 1,
    "max_score": 1.9589642,
    "hits": [
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "9",
        "_score": 1.9589642,
        "_source": {
          "url": "/MSOffice_HowTo/9780735699236.pdf"
        }
      }
    ]
  }

Yea it's bcos it matched light, working etc. but we can't display that document to the user as it's not relevant to user query. the document is all about ms office issues.

How to handle such scenarios ? we must display the document only if it has some relevant info to the user query.

Any suggestions are appreciated?

-Rahul

nik9000 · December 27, 2018, 4:14pm

If know what kinds of books the user cares about you could stick a keyword on the book and then add a bool query where both your match query and a new match query for the keyword field are in the must part of the query.

If you don't know up front how to tag your documents or what tags the user cares about then you are going to have to get more creating and Elasticsearch doesn't have things out of the box for you.

rahulnama · December 27, 2018, 5:30pm

Makes sense @nik9000

But for instance we tagged a document to windows: a file which has lot of sub topics like how to reset password, how to connect to internet etc.

If the user searches for "how to connect to internet" the query will not return any results because must query will return 0 as it doesn't match with windows.

Document tagging is one of the possible alternatives. but for this case I doubt it may not be the right fit. I will work on it.

And yea as you said we also don't know what user cares about.

As of now we are removing stop words, using stemmers, and tokenizers kind of NLP while indexing

any other alternatives? like using rescore api, tweaking bm25 parameters, putting a score limit on the document? or writing some advanced queries?

rahulnama · December 28, 2018, 5:06am

hi tim @Tim_Allison

any suggestions here ?

system · January 25, 2019, 5:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch relevency search Elasticsearch	3	126	March 27, 2024
Some potential hits not relevant for ElasticSearch? Elasticsearch	1	315	July 6, 2017
Incorrect relevance score of documents Elasticsearch	6	728	August 5, 2017
Improving the relevancy of documents Elasticsearch	2	408	October 13, 2018
Pessimization selected queries for selected indexes. How? Elasticsearch	2	349	July 6, 2017

Avoiding the irrelevant documents in Elasticsearch

Related topics