Avoiding the irrelevant documents in Elasticsearch

Hi Team

I've indexed few pdf files related to computer problems. Each document represents a book.

Whenever user searches a query we are giving the url of the book as a response.

query:

GET testbooks/_search
{    "_source": "url", 
    "query": {
        "match" : {
            "content" : {
                "query" : "light is not working",   
                "operator" : "and"
            }
        }
    }
}

Expected output:

"hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }

Actual Output:

"hits": {
    "total": 1,
    "max_score": 1.9589642,
    "hits": [
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "9",
        "_score": 1.9589642,
        "_source": {
          "url": "/MSOffice_HowTo/9780735699236.pdf"
        }
      }
    ]
  }

Yea it's bcos it matched light, working etc. but we can't display that document to the user as it's not relevant to user query. the document is all about ms office issues.

How to handle such scenarios ? we must display the document only if it has some relevant info to the user query.

Any suggestions are appreciated?

-Rahul

If know what kinds of books the user cares about you could stick a keyword on the book and then add a bool query where both your match query and a new match query for the keyword field are in the must part of the query.

If you don't know up front how to tag your documents or what tags the user cares about then you are going to have to get more creating and Elasticsearch doesn't have things out of the box for you.

2 Likes

Makes sense @nik9000

But for instance we tagged a document to windows: a file which has lot of sub topics like how to reset password, how to connect to internet etc.

If the user searches for "how to connect to internet" the query will not return any results because must query will return 0 as it doesn't match with windows.

Document tagging is one of the possible alternatives. but for this case I doubt it may not be the right fit. I will work on it.

And yea as you said we also don't know what user cares about.

As of now we are removing stop words, using stemmers, and tokenizers kind of NLP while indexing

any other alternatives? like using rescore api, tweaking bm25 parameters, putting a score limit on the document? or writing some advanced queries?

hi tim @Tim_Allison

any suggestions here ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.