I have been looking at search algorithm being used in elastic search and
found following set of rules which are applied while calculating the score
(Boolean Model)
more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.
In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below and
calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?
For the third rule, you can omit index norms for a field which will prevent
length normalization. See [1]. The option is either called omit_norms
or norms.enabled depending on your version.
For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.
I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)
more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.
In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?
Hi Ivan,
Thanks for the reply. We tried using norms.enabled property and it is
working fine. But what we have observed is this attribute works only on
string types. In our application we are indexing the word (.doc,.docx) and
pdf documents and performing test based search from document content. When
we define the norm.enabled for attachments types, normalization is not
working and size of document is being considered while calculating the
score.
Please suggest how do resolve this issue for attachment types.
On Monday, 27 January 2014 23:50:41 UTC+5:30, Ivan Brusic wrote:
For the third rule, you can omit index norms for a field which will
prevent length normalization. See [1]. The option is either
called omit_norms or norms.enabled depending on your version.
For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.
On Sun, Jan 26, 2014 at 11:12 PM, Hiro Gangwani <hiro.g...@gmail.com<javascript:>
wrote:
Dear Team,
I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)
more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.
In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?
Norms are applied at the field level, not at the index level. You would
need to omit norms for every field it is meant to apply to. Another
alternative would be to use index templates:
Hi Ivan,
Thanks for the reply. We tried using norms.enabled property and it is
working fine. But what we have observed is this attribute works only on
string types. In our application we are indexing the word (.doc,.docx) and
pdf documents and performing test based search from document content. When
we define the norm.enabled for attachments types, normalization is not
working and size of document is being considered while calculating the
score.
Please suggest how do resolve this issue for attachment types.
On Monday, 27 January 2014 23:50:41 UTC+5:30, Ivan Brusic wrote:
For the third rule, you can omit index norms for a field which will
prevent length normalization. See [1]. The option is either
called omit_norms or norms.enabled depending on your version.
For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.
I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)
more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.
In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?
Thanks in advance for suggesting solution.
Hiro
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.