I am just in the process of attempting to upgrade ElasticSearch and Nest to versions 2.3 from 1.7 and I am having some issues with the changes to the API. Most of it is fine and just a case of renaming things or removing some filter clauses but what is causing me serious problems is the fact that the similarities method when creating an index, no longer has an option for specifying a custom similarity.
We need to be able to set the QueryNorm to a fixed value as we do cross index searching as a part of the general search on the app and then those results need to be ranked against each other. We do this by giving up on TF/IDF and just assigning constant scores based on which fields the text matches in (so matching in a title gives you a higher rank than the description). E.g. when doing a global search, a contact that matches perfectly in the title should have the same score as a meeting that matches perfectly in the title.
To achieve that, we wrote our own custom similarity and hooked it up like this:
client.CreateIndex(x => x.Index(newIndexName).AddAlias(GetWriteAlias())
.Similarity(s => s.CustomSimilarities(cs => cs.Add("default", new CustomSimilarity())))
Now, however, I cannot see how to do that.
Is it still possible to do this is the latest version of ElasticSearch and Nest or have custom similarities been quietly deprecated?
Even better, would it be possible to do what I am trying to do without needing a custom similarity?
Mind sharing your CustomSimilarity here? Is it in one you've written in Java as a plugin, or are you simply configuring one of the built-in similarities that Elasticsearch provides out of the box?
If it's the latter, which is the more typical use case, then you just need to use to respective fluent method for that similarity.
For instance, configuring a "custom" BM25 similarity would look like this:
client.CreateIndex("myindex", c => c
.Similarity(s => s
.BM25("default", bm => bm
.K1(1.5)
.B(0.8)
.DiscountOverlaps()
)
)
);
If it's the former, then there unfortunately isn't a way to add a custom similarity using the fluent interface (we will open an issue for this), but you can with the object initializer syntax:
client.CreateIndex(new CreateIndexRequest("myindex")
{
Similarity = new Similarities
{
{ "default", new MySimilarity() }
}
});
where MySimilarity implements ISimilarity:
public class MySimilarity : ISimilarity
{
public string Type { get { return "my_similarity" } }
}
But I would first see if it's possible to tweak one of the built-in similarities first before rolling your own.
Even better, would it be possible to do what I am trying to do without needing a custom similarity?
There are a few queries that enable you to modify document scores, for instance check out the Constant Score and Function Score queries. You maybe be able to leverage these rather than messing with similarities.
Thanks for the reply Greg. It is indeed a plugin that we wrote in Java.
I have actually tried the object initializer route also but it seems to be ignoring it. I put my fully qualified name in the Type property but afterwards my queries still had the queryNorm value applied. I even changed it to some random string and the index still built fine and nothing complained. The searches even return fine and have the same value as when it was the correct string. That suggests it is either ignoring it completely or I am defining my type wrong and if it cannot find it, it silently just uses the default similarity instead (which would be quite odd behaviour).
We initially tried putting ConstantScores around everything but the problem is that the queryNorm still gets applied and it is still different for every index.
If you do a GET your-index-name do you see your custom similarity with the correct values? Can you post the output here? Can you also share your C# implementation of the custom similarity and your create index call?
Type should be the name of your custom similarity in ES (i.e. the name you're using when calling module.addSimilarity()). In this case that would be mysimilarity.
Something else is strange here though, because ES would have returned an error and not have created the index if you passed in the wrong similarity name. Has that not been the case at all?
Can you also share the JSON that NEST generates for your create index call? You will have to call .DisableDirectStreaming() on your ConnectionSettings if you're not already. I want to make sure your similarity is getting serialized correctly.
Thanks Greg. I was actually putting it in as the fully qualified class name, which is how we used to do it before the upgrade but I have changed it to be the name registered in the plugin as you stated. However, it has still made no difference.
No, in the end we have just decided to abandon the upgrade. We wanted to do it so we could upgrade to .Net 4.6 so we could use some libraries that are only available in 4.5+ to make it easier to implement a new feature.
It seems the custom similarity stuff is just not working at the moment but I also know that using them is generally discouraged by the team so better support is probably unlikely.
I think we will probably have to wait until we have more time available so that we can totally refactor our global search feature so that we can store everything (or at least a simplified generic version of everything) in a single index to avoid the problem we are having with normalising scores.
It is just a shame that there is no way to do cross index scoring without having the hack the internals like we did with our custom similarity. I get that you can't do TF/IDF with it but I thought it should be possible if you took scoring into your own hands with constants and functions like I have.
In the next 2.x and 5.x releases of NEST we will ship a CustomSimilarity that can survive roundtripping. e.g it will serialize when putting settings and deserialize properly when getting index settings.
e.g:
var createIndexRequest = new CreateIndexRequest(newIndexName)
{
Aliases = new Aliases{ { GetWriteAlias(), new Alias() } },
Similarity = new Similarities
{
{ "bm25", new BM25Similarity { B = 1.0, K1 = 1.1, DiscountOverlaps = true } },
{ "my_name", new CustomSimilarity("plugin_sim") {
{ "some_property", "some value" }
} }
//etc
see:
On top of that we'll fix that you can set per field custom similarities in the mapping:
If you want to apply a new default globally set the index.similarity.default.type: cluster setting.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.