How to calculate the score for a string to another string with the specific analyser


(Red) #1

we can use analyse API just like curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase&text=this+is+a+test'
to see what the text will be parsed to.

how can we test or get the score for a query string to another text string.
just like explain api.

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase&text=this+is+a+test' -d '{
"query": {
"match": { "some_field": "test" }
}
}'
when we do this, i can get the relevance score for query string "test" to the text string "this+is+a+test".

how to do that?


(Doug Turnbull) #2

Relevance scoring is based on TF*IDF based formulas. That means that at the foundation, three values matter the most in scoring

  • Term Frequency (TF): how many times does "test" occur here
  • Inverse Document Frequency (IDF): How rare is "test." Rare terms (low document frequency, or high IDF) recieve a higher score than common ones
  • Field norms: How short is the text? "test" occuring once in a short snippet is much more important to that snippet than "test" occuring once in a lengthy book.

These numbers are multiplied together to measure how weighty "test" is in the text being scored.

That's just the tip of the iceburg. First there's the fact that TF and IDF and field norms by themselves aren't directly proportional to relevance. So the various "similarities" as their called scale them differently. So instead of taking these numbers directly, TF, IDF, and fieldNorms are computed as

TF score = sqrt(tf)
IDF score = 1 / log( numDocs / (1 + docFreq) )
fieldnorms = 1/sqrt(length)

where
numDocs -- total number of docs in collection
length -- length of document in terms of positions (depending on if you discount overlaps).

Now there's so many gotchas and caveats here, that I really should just point you at several places to read more about this.

First ,probably the most detailed place to read about this topic as it pertains to Lucene is my relevance book. We dedicate quite a bit of space to the topic

Second, the Lucene & ES community have several well written articles on this topic

Finally, you should know that what's known as "TF*IDF" is being sunsetted as the default scoring computation by something new called BM25 in the next major Lucene version. BM25 is still based on the same statistics, but the computation has been shown experimentally to be far more robust. It's also more complex. I would recommend learning about BM25 at the following places

Hope that's useful! It's really just the beginning of an explanation of a bit of an intricate topic :slight_smile:


(Doug Turnbull) #3

Also to view the relevance scoring explanation, index the document that set explain=true

curl -XGET 'localhost:9200/yourindex/_search?explain=true' -d '{
"query": {
"match": { "some_field": "test" }
}
"explanation": true
}'

(Red) #4

i know that. but i want do this, calculate the score of a query string to a text string at query time. it is more likely testing the mapping and the analyser of the index type.

thank you, i have clearyly know the details of scoring.

total_score = term_score(term1) + term_score(term2) + ...

term_score(termN) = queryWeightScore(term1) + fieldWeightScore(term1)

fieldWeightScore(termN) = field_tf(termN) + field_idf(termN) + field_fieldNorm(termN)

termN is one term of the query string split by the analyser in query time or in mapping field.

ive found that the processing of queryWeight is the same as fieldWeight, i didnt find the explaination of queryWeight in website, i only find the explanation of fieldWegiht.
i just treat them the same( processing of queryWeight = processing of fieldWeight).

the thing above is all ive found . and the fieldNorm explanation is what i was finding.
in my scene, i set a product name field in mapping, i need treating the product name of each product the same level.


(system) #5