From experience I want sensitivity on subwords. By adding an extensive wordlist I find the subwords but I get undesired behavior. If the subwords are frequent or appear in two fields they are higher scored than the full word. One solution would be to boost on the tokens length, probably by the square of the length. Example
I have Swedish texts. The word "fastland" (means mainland) gets analysed/tokenized to "fastland", "fast", "land". A text with the word "Finland" appearing 5 times gets higher score than a text with the word "fastland" appearing once.
What Java-classes shall I subclass to make a search time boost on an individual token?
Is a bit of scripting in Painless the right way to go, i.e. does ctx or some parameter contain the hitting tokens?
I notice very little interest in this so I'll publish my solution, make some comments and then mark it solved.
The solution was to have different analyzers when indexing and searching. The important difference was to have the dictionary decompounder only while indexing. It is well documented how to do it here.
I made some attempts with the Painless language. My points of view are the following.
No enough available variables in the function score callback. At least the variables appearing in the Explanation should be available. I would also have wanted the hitting tokens.
The Debug class offers to little. One method to get an exception thrown is a negatively surprising choice to me.
The Groovy language should be available without plugging it in.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.