Modifying scoring algorithm during search operations

Dear Team,

I have been looking at search algorithm being used in elastic search and
found following set of rules which are applied while calculating the score
(Boolean Model)

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below and
calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?

Thanks in advance for suggesting solution.

Hiro

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f6936b6f-ef7c-4497-b186-bdba28176d89%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For the third rule, you can omit index norms for a field which will prevent
length normalization. See [1]. The option is either called omit_norms
or norms.enabled depending on your version.

For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.

[1]

[2]

--
Ivan

On Sun, Jan 26, 2014 at 11:12 PM, Hiro Gangwani hiro.gangwani@gmail.comwrote:

Dear Team,

I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?

Thanks in advance for suggesting solution.

Hiro

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f6936b6f-ef7c-4497-b186-bdba28176d89%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA1d7L6ixwNPMtVZ%2BcdsYv8HfAc4CC4gQY%3D%2BavfT-rxEA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ivan,
Thanks for the reply. We tried using norms.enabled property and it is
working fine. But what we have observed is this attribute works only on
string types. In our application we are indexing the word (.doc,.docx) and
pdf documents and performing test based search from document content. When
we define the norm.enabled for attachments types, normalization is not
working and size of document is being considered while calculating the
score.

Please suggest how do resolve this issue for attachment types.

Code to create the index for attachment types

XContentBuilder map = XContentFactory.jsonBuilder().startObject()
.startObject(idxType)
.startObject("properties")
.startObject("file")
.field("type", "attachement")
.field("norms.enabled", false)
.startObject("fields")
.startObject("refid")
.field("store", "yes")
.endObject()
.startObject("name")
.field("store", "yes")
.endObject()
.startObject("itexp")
.field("store", "yes")
.endObject()
.startObject("totalexp")
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject();

Hiro

On Monday, 27 January 2014 23:50:41 UTC+5:30, Ivan Brusic wrote:

For the third rule, you can omit index norms for a field which will
prevent length normalization. See [1]. The option is either
called omit_norms or norms.enabled depending on your version.

For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
Ivan

On Sun, Jan 26, 2014 at 11:12 PM, Hiro Gangwani <hiro.g...@gmail.com<javascript:>

wrote:

Dear Team,

I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?

Thanks in advance for suggesting solution.

Hiro

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f6936b6f-ef7c-4497-b186-bdba28176d89%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f80933eb-1b68-4c6f-b073-39b78e3f45e9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Norms are applied at the field level, not at the index level. You would
need to omit norms for every field it is meant to apply to. Another
alternative would be to use index templates:

--
Ivan

On Mon, Jan 27, 2014 at 10:32 PM, Hiro Gangwani hiro.gangwani@gmail.comwrote:

Hi Ivan,
Thanks for the reply. We tried using norms.enabled property and it is
working fine. But what we have observed is this attribute works only on
string types. In our application we are indexing the word (.doc,.docx) and
pdf documents and performing test based search from document content. When
we define the norm.enabled for attachments types, normalization is not
working and size of document is being considered while calculating the
score.

Please suggest how do resolve this issue for attachment types.

Code to create the index for attachment types

XContentBuilder map = XContentFactory.jsonBuilder().startObject()
.startObject(idxType)
.startObject("properties")
.startObject("file")
.field("type", "attachement")
.field("norms.enabled", false)
.startObject("fields")
.startObject("refid")
.field("store", "yes")
.endObject()
.startObject("name")
.field("store", "yes")
.endObject()
.startObject("itexp")
.field("store", "yes")
.endObject()
.startObject("totalexp")
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject();

Hiro

On Monday, 27 January 2014 23:50:41 UTC+5:30, Ivan Brusic wrote:

For the third rule, you can omit index norms for a field which will
prevent length normalization. See [1]. The option is either
called omit_norms or norms.enabled depending on your version.

For the second rule, it is slightly more complicated. You can define your
own custom similarity [2] that dictates how the TF, IDF and norms are used.
You simply extends Lucene's DefaultSimilarity (of TDIDFSimilarity) and at
it to elasticsearch's classpath.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/mapping-core-types.html#string
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/index-modules-similarity.html

--
Ivan

On Sun, Jan 26, 2014 at 11:12 PM, Hiro Gangwani hiro.g...@gmail.comwrote:

Dear Team,

I have been looking at search algorithm being used in Elasticsearch and
found following set of rules which are applied while calculating the score
(Boolean Model)

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

In my application we are doing text based search across set of word
documents. We would like to assign the higher scroe to documents having
more occurances and show at the top irrespective of size of document.
Primarily our application is recruitment system where is search is based
upon skill sets. So our business team wants to show the resumes having more
occurrences of search key words at top irrespective of size and rare terms.
Is there any mechanism to ignore second and third rules as listed below
and calculate the score based upon More occurrences condition only. We are
executing search operations using Java API. Please let me know is it
possible to achieve the same and if yes how?

Thanks in advance for suggesting solution.

Hiro

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f6936b6f-ef7c-4497-b186-bdba28176d89%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f80933eb-1b68-4c6f-b073-39b78e3f45e9%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCje_CLTBjf%2B6E9x0w_tn61GP9TvaacgC4d%2B38EPodAtA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.