Scripted Similarity performance


(Barry Woods) #1

We have previously been using a custom similarity plugin to meet the scoring needs of our application. It was pretty simple as all I wanted out of it was to ignore TF/IDF so I could compare scores across indices.

In the version 6 release, I have noticed there is a new option for Scripted Similarities. I believe this was added to meet the needs of people wanting to continue using the soon to be deprecated default TF/IDF similarity. However, I can see it perfectly suites my needs also.

I would very much like to get rid of custom similarity plugin, as it is a bit painful having to constantly maintain it with each new release. Before I switch from that to the new scripted similarity, I just wanted to know whether there would be any performance difference between them? As annoying as the plugin is to deal with, I wouldn't like to get rid of it if it is slower to process the scoring using a script.

Also, are there plans to add better API support for ScriptedSImilarities in Nest?

Thanks


(Shane Connelly) #2

The only way to know for sure about the performance characteristics between the two is to benchmark them. We've spent a lot of time making Painless very fast, plugging into native Elasticsearch functionality while also being less tightly bound to versioning than the plugin pains it sounds like you've run into. I would generally recommend making effort to get away from plugins if you can do so.

Can you talk a bit more about what you want to do with scripted similarities in Nest?


(Barry Woods) #3

Sure. We have the situation where our application has a few different data types (documents with meta data, contacts, meetings, discussions, etc.) and we have a global search page that searches across all of them. Due to them having different fields, they are all in different indices but we need to be able to compare them against each other to bring the most relevant things to the top.

To do that, we apply a boost to each field for text searches (e.g. title fields get boost of 10, whereas notes get boost of 5) and all other filters get constant scores of 0.

The first time we did that, we realised that the scores weren't coming out equatable due to the query norm and TF/IDF. That is why we brought in the plugin to hard code all of those values to 1. This is what it looks like:

public class FixedSimilarity extends ClassicSimilarity  {
    
    @Override
    public float idf(long docFreq, long numDocs) {
            return 1.0f;
    }
    @Override
    public float tf(float freq) {
            return 1.0f;
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
            return 1.0f;
    }
}

I think I will be able to replace that with the scripted similarity with a script of:

return query.boost;

If there is a better way to do what I am, I am open to alternatives.

Thanks


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.