Query_string 'OR' doesn't work with keyword fields unless explicitly queried with OR

ES versions tested with: 6.5.4, 7.0.0.
Mapping:

{
 "index-name": {
   "aliases": {},
   "mappings": {
     "doc": {
       "dynamic": "false",
       "properties": {
         "content": {
           "type": "text"
         },
         "diffusion_id": {
           "type": "keyword"
         },
         "hostname": {
           "type": "keyword"
         },
         "imei": {
           "type": "keyword"
         },
         "imsi": {
           "type": "keyword"
         },
         "ip_dest": {
           "type": "keyword"
         },
         "ip_source": {
           "type": "keyword"
         },
         "isp_id_dest": {
           "type": "keyword"
         },
         "isp_id_source": {
           "type": "keyword"
         },
         "location": {
           "type": "geo_point"
         },
         "msisdn": {
           "type": "keyword"
         },
         "msisdn_dest": {
           "type": "keyword"
         },
         "msisdn_source": {
           "type": "keyword"
         },
         "protocol": {
           "type": "keyword"
         },
         "timestamp": {
           "type": "date"
         },
         "type": {
           "type": "keyword"
         },
         "url": {
           "type": "keyword"
         }
       }
     }
   },
   "settings": {
     "index": {
       "creation_date": "1560186019564",
       "number_of_shards": "4",
       "number_of_replicas": "1",
       "uuid": "iAe3BeX4QbSklZfjxSwNSw",
       "version": {
         "created": "6050499"
       },
       "provided_name": "cdrs"
     }
   }
 }
}

Query:

{
 "size": 50,
 "sort": [
   {
     "timestamp": {
       "order": "desc",
       "unmapped_type": "date"
     }
   }
 ],
 "aggs": {},
 "version": true,
 "query": {
   "bool": {
     "must": [
       {
         "match_all": {}
       },
       {
         "query_string": {
           "query": "0033568198782 0033698347370",
           "analyze_wildcard": true,
           "default_operator": "OR"
         }
       },
       {
         "range": {
           "timestamp": {
             "gte": 1396875747000,
             "lte": 1724558331000,
             "format": "epoch_millis"
           }
         }
       }
     ],
     "must_not": []
   }
 },
 "_source": {
   "excludes": []
 },
 "stored_fields": [
   "*"
 ],
 "script_fields": {},
 "docvalue_fields": [
   "timestamp"
 ],
 "highlight": {
   "pre_tags": [
     "@kibana-highlighted-field@"
   ],
   "post_tags": [
     "@/kibana-highlighted-field@"
   ],
   "fields": {
     "*": {
       "highlight_query": {
         "bool": {
           "must": [
             {
               "match_all": {}
             },
             {
               "query_string": {
                 "query": "0033568198782 0033698347370",
                 "analyze_wildcard": true,
                 "all_fields": true
               }
             },
             {
               "range": {
                 "timestamp": {
                   "gte": 1396875747000,
                   "lte": 1724558331000,
                   "format": "epoch_millis"
                 }
               }
             }
           ],
           "must_not": []
         }
       }
     }
   },
   "fragment_size": 2147483647
 }
}

Here, the field 'msisdn' doesn't match with this query (when the value is let's say: '0033568198782'):

     "query_string": {
       "query": "0033568198782 0033698347370",
       "analyze_wildcard": true,
       "default_operator": "OR"
     }

But this query matches both, just fine:

       {
         "query_string": {
           "query": "0033568198782 OR 0033698347370",
           "analyze_wildcard": true,
           "default_operator": "OR"
         }
       }

But in the docs, it's written:

The default operator used if no explicit operator is specified. For example, with a default operator of OR , the query capital of Hungary is translated to capital OR of OR Hungary , and with default operator of AND , the same query is translated to capital AND of AND Hungary . The default value is OR .

This also kind of breaks the behaviour of search bar in kibana where a user has to type 0033568198782 OR 0033698347370 rather than 0033568198782 0033698347370 to match keyword fields. I get the difference between the tokenized and keyword/raw fields but this doesn't seem to be intended behaviour in this case. Is it a bug or there is some explanation for it?

Hi Varun,
The problem here is the fact you're searching untokenized keyword fields.
Whitespace could technically be a part of the content and this effects the way the query string is parsed.
I used the validate query API to inspect your query:

GET test/_doc/_validate/query?explain=true
{
	"query": {
	"query_string": {
	  "query": "0033568198782 0033698347370",
	  "analyze_wildcard": true,
	  "default_operator": "OR"
	}
  }
}

This gave back

"valid" : true,
"explanations" : [
{
  "index" : "test",
  "valid" : true,
  "explanation" : "+(imei:0033568198782 0033698347370 | imsi:0033568198782 0033698347370) #*:*"
}
]	

Note that the two terms are seen as one seperated by whitespace (keyword fields can contain whitespace).
It is only the introduction of the OR in the query string that forces the parsing to ignore the whitespace.
You could try searching a form of text field instead - maybe using copy_to e.g.

PUT test
{
  "settings": {
	"number_of_replicas": 0,
	"number_of_shards": 1
  },
  "mappings": {
	"_doc":{
	  "properties":{
		"all_content":{
		  "type":"text"
		},
		"imsi":{
		  "type":"keyword",
		  "copy_to":"all_content"
		},
		"imei":{
		  "type":"keyword",
		  "copy_to":"all_content"
		}
	
	  }
	}
  }
}
POST test/_doc/1
{
  "imsi":"0033568198782"
}
POST test/_doc/2
{
  "imei":"0033698347370"
}
GET test/_doc/_search
{
  "query": {
	"query_string": {
	  "query": "0033568198782 0033698347370",
	  "analyze_wildcard": true,
	  "default_operator": "OR"
	}
  }
}
2 Likes

Yes, it totally makes sense! I couldn't come up with the precise reason for this behaviour to satisfy others. Thank you!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.