Hi Ivan,
Since i not sure how analyzer with stopwords can be set in the query
itself. I tried to set the stopwords="none" via
index and its mapping :
*Index settings: *
{
"jdbc_dev": {
"settings": {
"index.analysis.analyzer.string_lowercase.filter": "lowercase",
"index.number_of_replicas": "1",
"index.analysis.analyzer.string_lowercase.tokenizer": "keyword",
"index.number_of_shards": "5",
"index.version.created": "900199",
* "index.analysis.analyzer.standard.type": "standard",*
Type Mapping :
{
"media": {
"properties": {
"AUDIO": {
"type": "string"
},
....
"DISPLAY_NAME": {
"type": "string",
* "analyzer": "standard"*
},
....
}
}
*Query : *
/media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary
{
"from" : 0,
"size" : 100,
"explain" : true,
"query" : {
"filtered" : {
"query" : {
"multi_match": {
"query": "happy",
"fields": [ "DISPLAY_NAME" ]
}
},
"filter" : {
"query" : {
"bool" : {
"must" : {
"term" : {
"CHANNEL_ID" : "1"
}
}
}
}
}
}
}
}
*Result : *
"_shard": 4,
"_node": "xsGVhtTnThaG57_mJdMtxg",
"_index": "jdbc_dev",
"_type": "media",
"_id": "127413",
"_score":* 6.614289*,
"_source": {
"DISPLAY_NAME": "Be Happy",
,
"_explanation": {
"value": 6.614289,
"description": "weight(DISPLAY_NAME:happy in 6485)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 6.614289,
"description": "fieldWeight in 6485, product
of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq
of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.582862,
"description": "idf(docFreq=93,
maxDocs=1364306)"
},
{
"value": 0.625,
"description": "fieldNorm(doc=6485)"
}
]
}
]
}
"_shard": 4,
"_node": "UOjX2lxhR6mzfjHHmTm3cQ",
"_index": "jdbc_dev",
"_type": "media",
"_id": "72253",
"_score": 6.614289,
"_source": {
"DISPLAY_NAME": "Happy Ways",
"_explanation": {
"value": 6.614289,
"description": "weight(DISPLAY_NAME:happy in 1102)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 6.614289,
"description": "fieldWeight in 1102, product
of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq
of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.582862,
"description": "idf(docFreq=93,
maxDocs=1364306)"
},
{
"value": 0.625,
"description": "fieldNorm(doc=1102)"
}
]
}
]
}
"_shard":* 4*,
"_node": "UOjX2lxhR6mzfjHHmTm3cQ",
"_index": "jdbc_dev",
"_type": "media",
"_id": "127413",
"_score": 6.614289,
"_source": {
"DISPLAY_NAME": "Be Happy",
"_explanation": {
"value": 6.614289,
"description": "weight(DISPLAY_NAME:happy in 7277)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 6.614289,
"description": "fieldWeight in 7277, product
of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq
of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.582862,
"description": "idf(docFreq=93,
maxDocs=1364306)"
},
{
"value": 0.625,
"description": "fieldNorm(doc=7277)"
}
]
}
]
}
Notice that from 1,2,3 items the scores are the same 6.614289 even though
the DISPLAY_NAME is different
- Be Happy
- Happy Ways
- Be Happy
It looks like it doesn't take into consideration the number of
character/length when it compute the score. I remember somewhere in the
document indicate that by default the algorithm should give higher score to
the document that have shorter text on the searched field however this
doesn't seem like the case. Also i didn't manually disable the norm.
Any suggestion that i could circumvent this issue ?
On Sat, Apr 5, 2014 at 12:39 PM, chee hoo lum cheehoo84@gmail.com wrote:
Hi Ivan,
I am trying to disable the stopwords and i am using version <1.0 ES. The
following is the query i ran :
{
"explain": true,
"query": {
"match_phrase": {
"DISPLAY_NAME": {
"query": "happy",
"operator": "and",
* "analyzer": { "stop" : { "type":"stop", "stopwords" : "none"
}}*
}
}
}
}
However it throws me error :
"error": "SearchPhaseExecutionException[Failed to execute phase [dfs],
total failure; shardFailures {[kr37FCksStOKW5ZCo6PCwQ][jdbc_dev][0]:
SearchParseException[[jdbc_dev][0]: from[-1],size[-1]: Parse Failure
[Failed to parse source [{\n "explain": true,\n "query": {\n
"match_phrase": {\n "DISPLAY_NAME": {\n "query":
"happy",\n "operator": "and",\n "analyzer": {
"stop" : { "type":"stop", "stopwords" : "none" }}\n }\n
}\n }\n}]]]; nested: QueryParsingException[[jdbc_dev] [match] query does
not support [stopwords]]; }{[VYQt633MTUuJdAwL--PE3A][jdbc_dev][1]:
May i know how to properly include stopwords = "none" in the query or
was it unavailable version prior than 1.0 ES.
I can find any relevant information in the documentation. Thanks.
On Fri, Apr 4, 2014 at 10:11 PM, Ivan Brusic ivan@brusic.com wrote:
The number of shards only affects the inverse document frequency. Items
such as the norm are document specific and are not affected by the number
of shards.
I did not notice it before, but the document with DISPLAY_NAME of "Be
Happy" is probably scoring the same as the others because "Be" is a stop
word and therefore removed from the index. You end up matching Happy with
Happy, which is the same as the other documents.
Try using an analyzer without stopwords. Query tuning is hard work.
Cheers,
Ivan
On Fri, Apr 4, 2014 at 2:46 AM, chee hoo lum cheehoo84@gmail.com wrote:
Hi,
Discovered that the score values are influenced by the shards and nodes
where the document stored.
Therefore specified the preference and query_type in the search query
however i still have no idea to get the result i wanted.
*The query : *
/media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary
{
"from" : 0,
"size" : 100,
"explain" : true,
"query" : {
"filtered" : {
"query" : {
"multi_match": {
"query": "happy",
"fields": [ "DISPLAY_NAME" ]
}
},
"filter" : {
"query" : {
"bool" : {
"must" : {
"term" : {
"CHANNEL_ID" : "1"
}
}
}
}
}
}
}
}
*Results : *
- "_shard": 0,*
"_node": "kr37FCksStOKW5ZCo6PCwQ",
"_index": "jdbc_dev",
"_type": "media",
"_id": "27071",
"_score": 10.450976,
"_source": {
"DISPLAY_NAME": "Happy",
"PRICE": 1.5,
....
"_explanation": {
"value": 10.450976,
"description": "weight(DISPLAY_NAME:happy in 2210)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 10.450976,
"description": "fieldWeight in 2210, product
of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with
freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.450976,
"description": "idf(docFreq=501,
maxDocs=6385732)"
},
{
"value": 1,
"description": "fieldNorm(doc=2210)"
}
]
}
]
}
-
"_shard": 0,
"_node": "kr37FCksStOKW5ZCo6PCwQ",
"_index": "jdbc_dev",
"_type": "media",
"_id": "565689",
"_score": 10.450976,
"_source": {
"DISPLAY_NAME": "Be Happy",
"PRICE": 1.5,
....
"_explanation": {
"value": 10.450976,
"description": "weight(DISPLAY_NAME:happy in 10189)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 10.450976,
"description": "fieldWeight in 10189,
product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with
freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.450976,
"description": "idf(docFreq=501,
maxDocs=6385732)"
},
{
"value": 1,
"description": "fieldNorm(doc=10189)"
}
]
}
]
}
- "_shard": 0,*
"_node": "kr37FCksStOKW5ZCo6PCwQ",
"_index": "jdbc_dev",
"_type": "media",
"_id": "425585",
"_score": 10.450976,
"_source": {
"DISPLAY_NAME": "Happy",
"PRICE": 4,
.....
"_explanation": {
"value": 10.450976,
"description": "weight(DISPLAY_NAME:happy in 10367)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 10.450976,
"description": "fieldWeight in 10367,
product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with
freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 10.450976,
"description": "idf(docFreq=501,
maxDocs=6385732)"
},
{
"value": 1,
"description": "fieldNorm(doc=10367)"
}
]
}
]
}
},
It is weird that it returned same score values even though the
DISPLAY_NAME is not same. I didn't disable the norm.
Anyone have any idea ?
On Thu, Apr 3, 2014 at 2:01 AM, chee hoo lum cheehoo84@gmail.comwrote:
Hi Ivan,
Nope i didn't disable the norm. Here's the mapping :
{
"media": {
"properties": {
"AUDIO": {
"type": "string"
},
"BILLINGTYPE_ID": {
"type": "long"
},
"CATMEDIA_CDATE": {
"type": "date",
"format": "dateOptionalTime"
},
"CATMEDIA_NAME": {
"type": "string"
},
"CATMEDIA_RANK": {
"type": "long"
},
"CAT_ID": {
"type": "long"
},
"CAT_NAME": {
"type": "string",
"analyzer": "string_lowercase",
"include_in_all": true
},
"CAT_PARENT": {
"type": "long"
},
"CHANNEL_ID": {
"type": "long"
},
"CKEY": {
"type": "long"
},
"DISPLAY_NAME": {
"type": "string"
},
"FTID": {
"type": "string"
},
"GENRE": {
"type": "string"
},
"ITEMCODE": {
"type": "string"
},
"KEYWORDS": {
"type": "string"
},
"LANG_ID": {
"type": "long"
},
"LONG_DESCRIPTION": {
"type": "string"
},
"MAPPINGS": {
"type": "string",
"analyzer": "string_lowercase",
"include_in_all": true
},
"MEDIA_ID": {
"type": "long"
},
"MEDIA_PKEY": {
"type": "string"
},
"PERFORMER": {
"type": "string"
},
"PLAYER": {
"type": "string"
},
"POSITION": {
"type": "long"
},
"PRICE": {
"type": "double"
},
"PRIORITY": {
"type": "long"
},
"SHORTCODE": {
"type": "string"
},
"SHORT_DESCRIPTION": {
"type": "string"
},
"TYPE_ID": {
"type": "long"
},
"VIEW_ID": {
"type": "long"
}
}
}
}
My client is nagging about the result relevancy returned. You know
business user always compare with google search result and stuff. lol. For
now i am scratching my head to sort this problem out. My use case is search
through by the display_name and performer and display as the closest
possible in the top of the list.
eg :
1)Happy
2)Happy
3)Be Happy
Would be deeply appreciated if you could shed me some light. Thanks
On Thu, Apr 3, 2014 at 1:51 AM, Ivan Brusic ivan@brusic.com wrote:
All the documents have the same score since they have the same field
weight, idf (always the same when you only have one search term) and term
frequency (each document has the term once).
It appears that you disabled norms on the DISPLAY_NAME field since
the field norm is 1. Is this correct? Can you provide the mapping? If you
disable norms, you will no longer get length normalization, which would
provide the ordering you desire since the field norms will penalize the
longer field, but it not might be ideal for every search. Relevancy
ultimately depends on you and your use cases. Another option is to enable
term vectors [1] (or index the number of terms yourself) and see if the
resulting field has the same number of tokens returned. Very kludgy.
[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
Cheers,
Ivan
On Wed, Apr 2, 2014 at 4:02 AM, chee hoo lum cheehoo84@gmail.comwrote:
Hi Binh,
The same problem again. I have the following queries :
{
"from" : 0,
"size" : 100,
"explain" : true,
"query" : {
"filtered" : {
"query" : {
"multi_match": {
"query": "happy",
"fields": [ "DISPLAY_NAME^6", "PERFORMER" ]
}
},
"filter" : {
"query" : {
"bool" : {
"must" : {
"term" : {
"CHANNEL_ID" : "1"
}
}
}
}
}
}
}
}
However the result display in reverse order for #2 and #3. I have
added the boost in the DISPLAY_NAME but still yield the same behaviour :
- "_score": 10.960511,*
"_source": {
"DISPLAY_NAME": "Happy",
"PRICE": 5,
"CHANNEL_ID": 1,
"CAT_PARENT": 981,
"MEDIA_ID": 390933,
"GENRE": "Happy",
"MEDIA_PKEY": "838644",
"COMPOSER": null,
"PLAYER": null,
"CATMEDIA_NAME": "Happy",
"FTID": null,
"VIEW_ID": 43,
"POSITION": 51399,
"ITEMCODE": null,
"CAT_ID": 982,
"PRIORITY": 80,
"CKEY": 757447,
"CATMEDIA_RANK": 3,
"BILLINGTYPE_ID": 1,
"CAT_NAME": "POP",
"KEYWORDS": null,
"LONG_DESCRIPTION": null,
"SHORT_DESCRIPTION": null,
"TYPE_ID": 74,
"ARTIST_GENDER": null,
* "PERFORMER": "Mario Pacchioli",*
"MAPPINGS": "1_43_982_POP_981_51399_5",
"SHORTCODE": null,
"CATMEDIA_CDATE": "2014-01-12T15:12:27.000Z",
"LANG_ID": 1
},
"_explanation": {
"value": 10.960511,
"description": "max of:",
"details": [
{
"value": 10.960511,
"description":
"weight(DISPLAY_NAME:happy^6.0 in 23025) [PerFieldSimilarity], result of:",
"details": [
{
"value": 10.960511,
"description": "fieldWeight in
23025, product of:",
"details": [
{
"value": 1,
"description":
"tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description":
"termFreq=1.0"
}
]
},
{
"value": 10.960511,
"description":
"idf(docFreq=58, maxDocs=1249243)"
},
{
"value": 1,
"description":
"fieldNorm(doc=23025)"
}
]
}
]
}
]
}
}
"_id": "10194",
* "_score": 10.699952,*
"_source": {
"DISPLAY_NAME": "Be Happy",
"PRICE": 1.5,
"CHANNEL_ID": 1,
"CAT_PARENT": 557,
"MEDIA_ID": 10194,
"GENRE": "Be Happy",
"MEDIA_PKEY": "534570",
"COMPOSER": null,
"PLAYER": null,
"CATMEDIA_NAME": "Be Happy",
"FTID": null,
"VIEW_ID": 241,
"POSITION": 6733,
"ITEMCODE": "33271",
"CAT_ID": 558,
"PRIORITY": 100,
"CKEY": 528380,
"CATMEDIA_RANK": 3,
"BILLINGTYPE_ID": 1,
"CAT_NAME": "POP",
"KEYWORDS": null,
"LONG_DESCRIPTION": null,
"SHORT_DESCRIPTION": null,
"TYPE_ID": 76,
"ARTIST_GENDER": null,
* "PERFORMER": "Mary J. Blige",*
"MAPPINGS": "1_241_558_POP_557_6733_1.5",
"SHORTCODE": "0012139471",
"CATMEDIA_CDATE": "2014-01-26T20:04:46.000Z",
"LANG_ID": 1
},
"_explanation": {
"value": 10.699952,
"description": "max of:",
"details": [
{
"value": 10.699952,
"description":
"weight(DISPLAY_NAME:happy^6.0 in 9092) [PerFieldSimilarity], result of:",
"details": [
{
"value": 10.699952,
"description": "fieldWeight in
9092, product of:",
"details": [
{
"value": 1,
"description":
"tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description":
"termFreq=1.0"
}
]
},
{
"value": 10.699952,
"description":
"idf(docFreq=80, maxDocs=1321663)"
},
{
"value": 1,
"description":
"fieldNorm(doc=9092)"
}
]
}
]
}
]
}
},
- "_score": 10.699952,*
"_source": {
"DISPLAY_NAME": "Happy",
"PRICE": 1.5,
"CHANNEL_ID": 1,
"CAT_PARENT": 557,
"MEDIA_ID": 8615,
"GENRE": "Happy",
"MEDIA_PKEY": "533022",
"COMPOSER": null,
"PLAYER": null,
"CATMEDIA_NAME": "Happy",
"FTID": null,
"VIEW_ID": 241,
"POSITION": 5685,
"ITEMCODE": "11927",
"CAT_ID": 558,
"PRIORITY": 100,
"CKEY": 526838,
"CATMEDIA_RANK": 3,
"BILLINGTYPE_ID": 1,
"CAT_NAME": "POP",
"KEYWORDS": null,
"LONG_DESCRIPTION": null,
"SHORT_DESCRIPTION": null,
"TYPE_ID": 76,
"ARTIST_GENDER": null,
* "PERFORMER": "Ashanti",*
"MAPPINGS": "1_241_558_POP_557_5685_1.5",
"SHORTCODE": "0012139036",
"CATMEDIA_CDATE": "2014-01-26T20:03:44.000Z",
"LANG_ID": 1
},
"_explanation": {
"value": 10.699952,
"description": "max of:",
"details": [
{
"value": 10.699952,
"description":
"weight(DISPLAY_NAME:happy^6.0 in 11167) [PerFieldSimilarity], result of:",
"details": [
{
"value": 10.699952,
"description": "fieldWeight in
11167, product of:",
"details": [
{
"value": 1,
"description":
"tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description":
"termFreq=1.0"
}
]
},
{
"value": 10.699952,
"description":
"idf(docFreq=80, maxDocs=1321663)"
},
{
"value": 1,
"description":
"fieldNorm(doc=11167)"
}
]
}
]
}
]
}
},
May i know how could the #2 and #3 yield the same scoring values even
it have different text value for both. Also how i could reverse the #2 and
#3 as what i want the result returned is based on relevancy thus i assume
that it should
return in this order.
1)Happy
2)Happy
3)Be Happy
Thanks.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD9aw%3Dh21OW_bJG4qbQ2TenQXa%2Bof8tgasVJqE16Bbysg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD9aw%3Dh21OW_bJG4qbQ2TenQXa%2Bof8tgasVJqE16Bbysg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
Regards,
Chee Hoo
--
Regards,
Chee Hoo
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg8Wg4xhUAqa3HrYDAOQ311iPrRL8EKAHniXLopCRie1Yg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg8Wg4xhUAqa3HrYDAOQ311iPrRL8EKAHniXLopCRie1Yg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCV5akPuRYFyc%3DiFOdje%3D8kgQ0xvPAa3K65iumUhVzrOg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCV5akPuRYFyc%3DiFOdje%3D8kgQ0xvPAa3K65iumUhVzrOg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
Regards,
Chee Hoo
--
Regards,
Chee Hoo
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_pqwMDv5oP6V0NAXkZq8F_gO4m%2B7_jHaWRMLdWTrZ8wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.