Scoring longest continuous string in documents


(Alexis Berger) #1

Hi,

I have a dataset containing about 15M documents.
Each document has one field which contains a small text 50 words at max.

I made lots of search query tests until I found the query I needed, managing stop words and missing words.
It is a function score query wrapping a common terms query, and filtering the results with a phrase match query:
(for the given example, let say that the cutoff_frequency value is right and skip stop words correctly)

GET xxxx/_search
{
"fields": ["myField"],
"from" : 0,
"size" : 10,
"query" : {
"function_score" : {
"query" : {
"bool" : {
"must" : {
"common": {
"myField": {
"query": "word1 word2 word3 word4",
"cutoff_frequency": 0.149,
"low_freq_operator": "or",
"high_freq_operator": "or",
"minimum_should_match": 3
}
}
}
}
},
"functions" : [ {
"filter" : {
"query" : {
"match" : {
"oneField.normal" : {
"query" : "word1 word2 word3 word4",
"type" : "phrase",
"slop" : 0
}
}
}
},
"weight" : 1.0
},
"score_mode" : "sum",
"boost_mode" : "sum"
}
}
}

Everything is good, except documents ranking.
The default similarity does not seem to match my needs. I need the best scores for documents that have the longest continous string (most adjacent words).
What is the best way to achieve this ranking? Using another similarity? Maybe I missed a special query in the documentation...

Any help/advice would be much appreciated!
Thanks

BTW I am using ES 1.5 :slight_smile:


(system) #2