How to calculate the score for a string to another string with the specific analyser

Red · December 31, 2015, 11:48am

we can use analyse API just like curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase&text=this+is+a+test'
to see what the text will be parsed to.

how can we test or get the score for a query string to another text string.
just like explain api.

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase&text=this+is+a+test' -d '{
"query": {
"match": { "some_field": "test" }
}
}'
when we do this, i can get the relevance score for query string "test" to the text string "this+is+a+test".

how to do that?

softwaredoug · December 31, 2015, 8:01pm

Relevance scoring is based on TF*IDF based formulas. That means that at the foundation, three values matter the most in scoring

Term Frequency (TF): how many times does "test" occur here
Inverse Document Frequency (IDF): How rare is "test." Rare terms (low document frequency, or high IDF) recieve a higher score than common ones
Field norms: How short is the text? "test" occuring once in a short snippet is much more important to that snippet than "test" occuring once in a lengthy book.

These numbers are multiplied together to measure how weighty "test" is in the text being scored.

That's just the tip of the iceburg. First there's the fact that TF and IDF and field norms by themselves aren't directly proportional to relevance. So the various "similarities" as their called scale them differently. So instead of taking these numbers directly, TF, IDF, and fieldNorms are computed as

TF score = sqrt(tf)
IDF score = 1 / log( numDocs / (1 + docFreq) )
fieldnorms = 1/sqrt(length)

where
numDocs -- total number of docs in collection
length -- length of document in terms of positions (depending on if you discount overlaps).

Now there's so many gotchas and caveats here, that I really should just point you at several places to read more about this.

First ,probably the most detailed place to read about this topic as it pertains to Lucene is my relevance book. We dedicate quite a bit of space to the topic

Second, the Lucene & ES community have several well written articles on this topic

The Java docs on TFIDFSimilarity
The ES: The Definitive Guide discussion of relevance scoring

Finally, you should know that what's known as "TF*IDF" is being sunsetted as the default scoring computation by something new called BM25 in the next major Lucene version. BM25 is still based on the same statistics, but the computation has been shown experimentally to be far more robust. It's also more complex. I would recommend learning about BM25 at the following places

My Blog article
This Elastic blog article

Hope that's useful! It's really just the beginning of an explanation of a bit of an intricate topic

softwaredoug · December 31, 2015, 8:03pm

Also to view the relevance scoring explanation, index the document that set explain=true

curl -XGET 'localhost:9200/yourindex/_search?explain=true' -d '{
"query": {
"match": { "some_field": "test" }
}
"explanation": true
}'

Red · January 2, 2016, 8:07am

i know that. but i want do this, calculate the score of a query string to a text string at query time. it is more likely testing the mapping and the analyser of the index type.

thank you, i have clearyly know the details of scoring.

total_score = term_score(term1) + term_score(term2) + ...

term_score(termN) = queryWeightScore(term1) + fieldWeightScore(term1)

fieldWeightScore(termN) = field_tf(termN) + field_idf(termN) + field_fieldNorm(termN)

termN is one term of the query string split by the analyser in query time or in mapping field.

ive found that the processing of queryWeight is the same as fieldWeight, i didnt find the explaination of queryWeight in website, i only find the explanation of fieldWegiht.
i just treat them the same( processing of queryWeight = processing of fieldWeight).

the thing above is all ive found . and the fieldNorm explanation is what i was finding.
in my scene, i set a product name field in mapping, i need treating the product name of each product the same level.

Topic		Replies	Views
Custom relevance scoring by term frequency averages Elasticsearch	2	1217	July 6, 2017
Relevance Score calculation Elasticsearch	1	370	August 1, 2018
Customizing relevant scoring in Elasticsearch Elasticsearch	2	976	July 5, 2017
Change the scoring function for array using best score for elements Elasticsearch	4	1174	December 6, 2018
Score de pertinence d'un document Discussions en français	7	1706	July 6, 2017

How to calculate the score for a string to another string with the specific analyser

Related topics