Issues with size and bool queries returning data outside of criteria

I understand that a query given no size will return 10 results by default.
Why, I must ask if size is 10000 and the query is specific to certain data, does it return results out side of the search criteria?
How then, would one construct a query like these that only return results from the search criteria, regardless of the size parameter?

First search using should...
{
"size":10000,
"query":
{
"bool":
{
"must":
[
{
"match": { "MsgType": "0" }
},
{
"match": { "beat.hostname":"ny2-prd-venus20"}
},
{
"bool":
{
"should":
[
{ "match": { "source": "/trades/BTEC-NY2-DC-PRD-2/var/20181107.log" }},
{ "match": { "source": "/trades/BTEC-NY2-DC-PRD-2/var/20181106.log" }}
]
}
}
]
}
}
}

This returns 10000 results regardless of the criteria and several source records that are not in the should which is inside the must???

Second search with no should...
{
"size":10000,
"query":
{
"bool":
{
"must":
[
{
"match": { "MsgType": "0" }
},
{
"match": { "beat.hostname":"ny2-prd-venus20"}
},
{
"match": { "source": "/trades/BTEC-NY2-DC-PRD-2/var/20181107.log" }
}
]
}
}
}

This returns 10000 results again similar to above.

There are not 10000 records matching either criteria, more like 8K records total.

Is there any way to specifically get these 8K records without a hefty must_not listing all other sources?

A bit more info. From the first search, I messed with size and added a sort on timestamp descending. Given these source files are the last 2 files in the file system by date, one would expect that these would be the first "relative" records...
When using size == 100,000 (increased max to test quickly)
I get 7,619/100,000 records that actually match the criteria, however not the first 7,619 records.
When using size == 10,000
I get 828/10,0000 records that actually match the criteria, however also not the first records received.

Here are the file records, aka expectation
grep BI9 /trades/BTEC-NY2-DC-PRD-2/var/20181107.log |wc -l
9240
grep BI9 /trades/BTEC-NY2-DC-PRD-2/var/20181106.log |wc -l
13819
So I should only be getting about 23k records returned

What is the mapping for your source field? Can you show some documents that are returned when you use the should clause and that you are not expecting to match?

using filebeat->logstash->kibana source is automatically assigned
{
"log_event_type": "FIX",
"timestamp": "2018-11-07 14:07:23.852",
"MsgSeqNum": "00002390",
"beat": {
"hostname": "ny2-prd-venus23"
},
"micros": "300",
"SenderCompID": "LQE357A01",
"TargetCompID": "LE",
"source": "/trades/LIQE-NY2-DC-GUI-PRD-1/var/20181107.log",
"host": {
"name": "ny2-prd-venus23"
},
"MsgType": "0",
"communication": "COMM OUT"
}

The counts are unexpected record number and total record number
I removed some fields from _sources to play around, here are a few more
9165 9996 {
"log_event_type": "FIX",
"source": "/trades/ICE-CH1-DC-PRD-1/var/20181107.log",
"beat": {
"hostname": "ch1-prd-venus06"
},
"micros": "238",
"MsgSeqNum": "0991191",
"MsgType": "0",
"timestamp": "2018-11-07 14:14:31.071"
}
9166 9997 {
"log_event_type": "FIX",
"source": "/trades/LIQE-NY2-DC-GUI-PRD-1/var/20181107.log",
"beat": {
"hostname": "ny2-prd-venus23"
},
"micros": "603",
"MsgSeqNum": "2328",
"MsgType": "0",
"timestamp": "2018-11-07 14:14:29.612"
}
9167 9998 {
"log_event_type": "FIX",
"source": "/trades/LIQE-NY2-DC-PROP-PRD-1/var/20181107.log",
"beat": {
"hostname": "ny2-prd-venus23"
},
"micros": "637",
"MsgSeqNum": "2328",
"MsgType": "0",
"timestamp": "2018-11-07 14:14:29.612"
}
9168 9999 {
"log_event_type": "FIX",
"source": "/trades/LIQE-NY2-DC-GUI-PRD-1/var/20181107.log",
"beat": {
"hostname": "ny2-prd-venus23"
},
"micros": "262",
"MsgSeqNum": "2268",
"MsgType": "0",
"timestamp": "2018-11-07 14:14:28.630"
}

If you are using the dynamic mappings, the source field is analysed for free text search. This means that it will be broken up into tokens based on the standard analyser and strings that only match some of these, e.g. trades, will match.

GET _analyze
{
  "analyzer" : "standard",
  "text" : "/trades/BTEC-NY2-DC-PRD-2/var/20181106.log"
}

shows that the string is analysed into the following tokens:

{
  "tokens" : [
    {
      "token" : "trades",
      "start_offset" : 1,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "btec",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "ny2",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "dc",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "prd",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "2",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "<NUM>",
      "position" : 5
    },
    {
      "token" : "var",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "20181106",
      "start_offset" : 30,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 7
    },
    {
      "token" : "log",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

If you want the full string to match, you might use a terms query on the source.keyword field instead.

It still makes little sense to me, but it is working now with this
{
"query": {
"bool": {
"must": [
{
"term": {
"MsgType.keyword": "0"
}
},
{
"term": {
"beat.hostname.keyword": "ny2-prd-venus20"
}
},
{
"terms": {
"source.keyword": [
"/trades/BTEC-NY2-DC-PRD-2/var/20181107.log",
"/trades/BTEC-NY2-DC-PRD-2/var/20181106.log"
]
}
}
]
}
},
"size": 100000
}

23744 23744 {
"log_event_type": "Exchange Dropcopy Common",
"exchange": "btec",
"beat": {
"hostname": "ny2-prd-venus20",
"version": "6.4.2",
"name": "ny2-prd-venus20"
},
"timestamp": "2018-11-06 17:23:58.233",
"@timestamp": "2018-11-06T23:23:58.233Z",
"tags": [
"beats_input_codec_plain_applied"
],
"source": "/trades/BTEC-NY2-DC-PRD-2/var/20181106.log",
"host": {
"name": "ny2-prd-venus20"
},
"MsgType": "0",
"micros": 713,
"offset": 6681866,
"input": {
"type": "log"
},
"message": " BI9 - received",
"@version": "1",
"prospector": {
"type": "log"
},
"levelname": "INFO"

TY for the help, the docs could use examples like these - currently this is not very intuitive.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.