thanks for taking the time to read my question. Quite new to the App Search, our Dutch insurance company just went live with the 7.14.1 version. We are checking search terms and desired results and in that process we come across some things we do not understand. Some search terms that occur in both more of the searched fields and more often within the same search field do not get a higher score than the other document. I understand that the indexing algoriths might be less straightforward or intuitive, but still we want to understand as much as possible and improve search results. Without having to use a lot of Curated results, Tuning and/or Boosting.
Below an example.
In our setup we use the API to search for CMS documents in App Search, but results are the same within App Search UI with Query tester. We use no altered weighing (all weights are 1), no boosting and search only 5 fields: url, title, body, meta keywords, meta description.
Example
When we search for 'zorgverzekering', which is Dutch for health insurance, we get the following results back among others:
Result 2 page Gemeentepakket
0x in url
0x in title
6x in body
1x in meta description
1x in meta keywords
Result 6 Zorg page campagnepagina
0x in url
1x in title
14x in body
1x in meta description
1x in meta keywords
Result 34 page unive.nl/zorgverzekering <- this is in our view the most relevant page
1x in url
1x in title
36x in body
1x in meta description
1x in meta keywords
If anyone can share insight on what other factors contribute to the scoring of search results, that would be much appreciated
Things like how often a term shows up in a single document, how often a term shows up in ALL documents, and length of individual fields that a term appears in are all a factor in relevance scoring.
"Relevance" is a relative concept; whether a document is "relevant" or not for a particular search is entirely context dependent, and differs from use case to use case. This is why we provide our relevance tools, like weights and boosts. You'll need those tools to ensure that the "right" documents are considered relevant for your use case.
Hi Jason, very interesting read and explains a lot. Our example most likely suffers from the fact that the top result while containing the search term less often is a way shorter document. Probably this results in the BM25 algorithm scoring it high.
Will try to tune our search fields and weights to see if we get different results. Hopefully we do not have to resort to tweaking the b and k1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.