Raw tf-idf

pierce · July 6, 2017, 8:54pm

I'm looking for a really scalable method to do scoring by tf-idf on a large dataset, so elastic search naturally came to mind. Crucially, I need raw tf-idf, versus the Practical Scoring Function that Lucene uses under the hood. Is there any way to get elastic search to return a raw tf-idf score without the additional fluff? I've tested each of the built-in implementations and none work as well as just tf-idf.

So far I've investigated custom scoring functions, but this doesn't seem to be the right tool for the task.

Also, I'm using a hosted elaticsearch instance, so I don't have access to the internals of the Java code. Only the REST API.

warkolm · July 6, 2017, 9:28pm

Elasticsearch uses BM25 as of 5.0.

It is likely to be though, what put you off that?

pierce · July 6, 2017, 10:06pm

All the examples seemed to be tied to the individual fields and looked like there wasn't any ability to extract the overall document frequency of each term. Would love to find a way to do this though.

rjernst · July 6, 2017, 10:32pm

See the new advanced scripting docs which describe how to create a script engine, which then has access to lucene internals (you could get raw tf and idf).

https://www.elastic.co/guide/en/elasticsearch/reference/5.x/modules-scripting-engine.html

pierce · July 6, 2017, 10:35pm

Hi Ryan - is there a way to facilitate similar scripting only via REST? I have to use a remote elastic search client so don't have the ability to add a plugin.

rjernst · July 6, 2017, 10:50pm

There is no other way to gain access to lucene internals.

system · August 3, 2017, 10:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to use lucene expressions in elasticsearch scripts to use tfidf similarities of specific fields as a part of custom scoring function Elasticsearch	3	402	April 6, 2019
Understanding Elasticsearch TF/IDF score in 6.8 Elasticsearch	4	791	April 16, 2020
Question about the future release of ES that incorporate Lucene 7.0 Elasticsearch	3	681	April 24, 2017
Facet query sorted by tf*idf Elasticsearch	3	355	July 6, 2017
Search over most frequent matches / terms without TF or IDF adjustment Elasticsearch	1	553	July 5, 2017

Raw tf-idf

Related topics