Elastic Scoring for perfect match


I have a JSON query that gives different results when executed in Elastic vs executed in Kibana. In my use case of security log searching perfect matches are the desired answer, so I believe that Kibana is 'right' and Elastic is 'not 100% right'

From reading around I believe this is down to Scoring, which looks amazingly powerful, but seems too complicated so I wonder if I'm missing something?

The query I use is the creation of a layered approach, so many subqueries are bundled into one master query, here is one example:

  "query": {
	  "bool": {
		  "must": [{
				  "match": {
					  "answers": "lgincdn.trafficmanager.net,lgincdnvzeuno.azureedge.net,lgincdnvzeuno.ec.azureedge.net,cs1227.wpc.alphacdn.net,"
			  }, {
				  "match": {
					  "source_ip": ""
		  "filter": []
  "size": 10000,
  "_source": ["*"]

In elastic I get 18 hits, some of these are not perfect matches

In Kibana (query converted to RISON) for the same time frame I get 1 hit

Does it sound like I understand the problem correctly, and can someone give me a pointer into the right direction to force perfect matches in Elastic in the simplest way?

Many thanks

The 1 hit in Kibana

    "ts": "2020-04-18T17:45:10.783496Z",
    "uid": "C1hMM64CCTc971GJGd",
    "id.orig_h": "",
    "id.orig_p": 58952,
    "id.resp_h": "",
    "id.resp_p": 53,
    "proto": "udp",
    "trans_id": 37980,
    "rtt": 0.030848026275634766,
    "query": "logincdn.msauth.net",
    "qclass": 1,
    "qclass_name": "C_INTERNET",
    "qtype": 1,
    "qtype_name": "A",
    "rcode": 0,
    "rcode_name": "NOERROR",
    "AA": false,
    "TC": false,
    "RD": true,
    "RA": true,
    "Z": 0,
    "answers": ["lgincdn.trafficmanager.net", "lgincdnvzeuno.azureedge.net", "lgincdnvzeuno.ec.azureedge.net", "cs1227.wpc.alphacdn.net", ""],
    "TTLs": [155.0, 29.0, 1158.0, 3599.0, 1123.0],
    "rejected": false

18 hits in Elastic, here is one hit, it's similar, but not a perfect match

answers: ["cs199.wpc.alphacdn.net", ""]
destination_ips: ""
source_ip: ""
protocol: "udp"
event_type: "bro_dns"
destination_ip: ""
parent_domain_length: 5
syslog-facility: "user"
host: "gateway"
query_class: 1
aa: false
transaction_id: 22868
syslog-priority: "notice"
query: "files3.lynda.com"
rcode: 0
query_type: 1
subdomain_frequency_score: 7.5615
ips: ["", ""]
syslog-host: "seconion-NU691"
ra: true
tags: ["syslogng", "bro", "dns", "top-1m", "internal_destination", "internal_source"]
ttls: [40, 3370]
rd: true
port: 50718
subdomain: "files3"
syslog-tags: ".source.s_bro_dns"
frequency_scores: ["8.2685", "7.5615"]
syslog-host_from: "seconion-nu691"
parent_domain: "lynda"
syslog-sourceip: ""
query_class_name: "C_INTERNET"
highest_registered_domain: "lynda.com"
top_level_domain: "com"
destination_port: 53
rejected: false
source_ips: ""
uid: "CkeLOB18R37pFLbtr3"
highest_registered_domain_frequency_score: 8.2685
source_port: 60965
syslog-file_name: "/nsm/bro/logs/current/dns.log"
@version: "1"
timestamp: "2020-04-03T09:05:38.475Z"
logstash_time: 0.02882218360900879
message: "{"ts":"2020-04-03T09:05:37.392974Z","uid":"CkeLOB18R37pFLbtr3","id.orig_h":"","id.orig_p":60965,"id.resp_h":"","id.resp_p":53,"proto":"udp","trans_id":22868,"rtt":0.0424351692199707,"query":"files3.lynda.com","qclass":1,"qclass_name":"C_INTERNET","qtype":1,"qtype_name":"A","rcode":0,"rcode_name":"NOERROR","AA":false,"TC":false,"RD":true,"RA":true,"Z":0,"answers":["cs199.wpc.alphacdn.net",""],"TTLs":[40.0,3370.0],"rejected":false}"
tld: {subdomain: "files3.lynda.com"}
subdomain_length: 6
tc: "false"
rcode_name: "NOERROR"
query_length: 16
rtt: 0.0424351692199707
@timestamp: "2020-04-03T09:05:37.392Z"
query_type_name: "A"
z: 0

It depends on the mapping you are using.
If you are using a keyword data type then it will only match with exact terms.
Otherwise the text is analyzed before being indexed which produces this behavior.

If you are using the default mapping, you can append .keyword to the field name. It will do perfect match.

Thanks David, this works and after more testing I realised this is half the challenge

If I understand correctly ".keyword" forces an exact match on the entire field

Is it possible to force an exact match using a substring?

e.g. query match on "Bob,Charlie"
Should match "Alice,Bob,Charlie"
Should not match "Charlie.Bob"


.keyword sub field is generated at index time behind the scene by Elasticsearch with the default mapping. It creates a .keyword sub field which has the type keyword. Which means that it is indexed exactly as it has been provided = no transformation.

If you search within this field, indeed only exact matches will work.

Is it possible to force an exact match using a substring?

Not sure. May be with match phrase query but on a text field. It should guarantee at least the positions of the tokens if this is what you are after.

But your example seems theorical, right? May be share a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

Short version
Many thanks David, let me go away and rephrase the question with examples

Long version
In this exact situation my process incorrectly turned an array of indexed strings into a single string which it was then comparing against the raw message, which is why I'm getting mixed results.
So at the moment my testing is flawed, I need to address how the project does this.
However in the long term I still need to change the query to be more precise.
I'll come back this in the near future when other bits are sorted

Thanks David, the ".keyword" helps for now :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.