Raw tf-idf

I'm looking for a really scalable method to do scoring by tf-idf on a large dataset, so elastic search naturally came to mind. Crucially, I need raw tf-idf, versus the Practical Scoring Function that Lucene uses under the hood. Is there any way to get elastic search to return a raw tf-idf score without the additional fluff? I've tested each of the built-in implementations and none work as well as just tf-idf.

So far I've investigated custom scoring functions, but this doesn't seem to be the right tool for the task.

Also, I'm using a hosted elaticsearch instance, so I don't have access to the internals of the Java code. Only the REST API.

Elasticsearch uses BM25 as of 5.0.

It is likely to be though, what put you off that?

All the examples seemed to be tied to the individual fields and looked like there wasn't any ability to extract the overall document frequency of each term. Would love to find a way to do this though.

See the new advanced scripting docs which describe how to create a script engine, which then has access to lucene internals (you could get raw tf and idf).


1 Like

Hi Ryan - is there a way to facilitate similar scripting only via REST? I have to use a remote elastic search client so don't have the ability to add a plugin.

There is no other way to gain access to lucene internals.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.