Setting the whitespace analyzer for query_string search

Josip_Cagalj · July 9, 2020, 11:28am

Hi all,
I have a problem regarding how the query string is tokenized when performing a query_string search. Could it be possible to opt for the whitespace tokenizer instead of standard because I'm searching for an exact phrase that contains hyphens '-' (like GUID) which gets split up in parts and then searched for. What I'm ended up with is instead of one document result (with exact GUID) I get multi results with all the records whose field contains any of those parts.
To put the whole picture, I have a watcher that searches for output transactions with large amounts ('outputs' part of the chained search) and if any (transform is done in the 'orders_lookup' part of the chained search), searches for corresponding inputs ('inputs' part of the chained search). Here is the part of that watcher:

  "input": {
    "chain": {
      "inputs": [
        {
          "outputs": {
            "search": {
              "request": {
                "search_type": "query_then_fetch",
                "indices": [
                  "store"
                ],
                "rest_total_hits_as_int": true,
                "body": {
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "term": {
                            "status": 0
                          }
                        },
                        {
                          "term": {
                            "transactionType": 4
                          }
                        },
                        {
                          "range": {
                            "amount": {
                              "gte": "{{ctx.metadata.threshold}}"
                            }
                          }
                        },
                        {
                          "range": {
                            "eventTime": {
                              "gte": "now-{{ctx.metadata.window_period_outputs}}m"
                            }
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        },
        {
          "orders_lookup": {
            "transform": {
              "script": {
                "source": """HashSet orders = new HashSet();
                  for (output in ctx.payload.outputs.hits.hits) orders.add(output._source.OrderId);
                  return ['ordersA' : orders];""",
                "lang": "painless"
              }
            }
          }
        },
        {
          "inputs": {
            "search": {
              "request": {
                "search_type": "query_then_fetch",
                "indices": [
                  "store"
                ],
                "rest_total_hits_as_int": true,
                "body": {
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "term": {
                            "status": 0
                          }
                        },
                        {
                          "term": {
                            "transactionType": 3
                          }
                        },
                        {
                          "range": {
                            "eventTime": {
                              "gte": "now-{{ctx.metadata.window_period_inputs}}m"
                            }
                          }
                        },
                        {
                          "query_string": {
                            "default_field": "OrderId.txt",
                            "query": "{{#ctx.payload.orders_lookup.ordersA}}'{{.}}' {{/ctx.payload.orders_lookup.ordersA}}"
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        }
      ]
    }
  }

Given that there could be more than one large transaction, I'm using the query_string for searching the inputs where order ids are queried.
Field mapping is:

"OrderId": {
  "type": "keyword",
  "fields": {
	"txt": {
	  "type": "text"
	}
  }
}

And here is an example:

{
  "query_string": {
	"default_field": "OrderId.txt",
	"query": "d6220c50-9ec1-ea11-9b05-501ac5532e5e"
  }
}

Will return any document which OrderId field contains any of the tokens 'd6220c50', '9ec1', 'ea11', '9b05' or '501ac5532e5e'.
If I add the analyzer:

{
  "query_string": {
	"default_field": "OrderId.txt",
	"query": "d6220c50-9ec1-ea11-9b05-501ac5532e5e",
	"analyzer": "whitespace"
  }
}

I get 0 hits ?

Thanks in advance

Vinayak_Sapre · July 13, 2020, 12:44am

Why not use term query on OrderId field?

If OrderId is always a random guid I will not define txt field at all.

Josip_Cagalj · July 13, 2020, 6:00am

First let me thank you for your replay.
In my watcher there could be more than one output transaction that matches the criteria. I than extract the order ids into a hash set. Next I'm searching for input transactions for those orders.
My setup is that I have two transactions for each order, input and output (there should be one input transaction for each output transactions). Order ids are out of my control, some of them are GUIDs and some are strings or even numbers (they are all unique).

Vinayak_Sapre · July 13, 2020, 6:35am

Hi Josip,
You are welcome.

My point is irrespective of what value OrderId contains it's never searched by partial string (token). So there is no point analyzing it. For ex. would you ever need to find order "d6220c50-9ec1-ea11-9b05-501ac5532e5e" by searching "9ec1" or find all orders that contain "9ec1"?

I am not much familiar with watcher. I am only suggesting instead of query_string query you can use terms query
like

{
  "terms": {
	"OrderId" :  [ "d6220c50-9ec1-ea11-9b05-501ac5532e5e", "2nd OrderId", "3rd OrderId"... ]
  }
}

Josip_Cagalj · July 13, 2020, 7:57am

Once again thanks for your participation in this. Now that you mentioned I remember why I needed to use query_string. It's because of how I'm constructing the query (a mustache loop):

"query": "{{#ctx.payload.orders_lookup.ordersA}}'{{.}}' {{/ctx.payload.orders_lookup.ordersA}}"

I wanted to use the terms query but there was an issue creating the input for the terms query (string is wrapped in triple quotes due to Kibana’s handling of strings, and the array isn’t parsed correctly).
So I end up updated mapping for OrderId field to create a sub-field of type ’text’ (initially it was only a keyword field) to use with the query string.

Vinayak_Sapre · July 13, 2020, 8:39am

@Josip_Cagalj

Avoiding txt field will reduce your index size, speed up ingestion.

I don't have Xpack license to try it out. But if you post a new question asking how to pass /assign array of strings in watcher, I am sure someone will help you.

If you haven't already can you try this?

in the transform script
return ['ordersA' : orders.toArray(new String[orders.size()]) ];
replace query_string query with

{
  "terms": {
	"OrderId" :  "{{ctx.payload.orders_lookup.ordersA}}"
  }
}

Josip_Cagalj · July 13, 2020, 12:01pm

I've tried but I'm getting error:

        "type" : "script_exception",
        "reason" : "compile error",
        "script_stack" : [
          "... ordersA' : orders.toArray(new String[orders.size()] ...",
          "                              ^---- HERE"
        ],
		
		...
		
	  "caused_by" : {
      "type" : "class_cast_exception",
      "reason" : "Cannot cast from [java.lang.String[]] to [def[]]."

Vinayak_Sapre · July 14, 2020, 5:35am

@Josip_Cagalj
Can you try this?

in the transform script
return ['ordersA' : orders.join("\",\"")];
replace query_string query with

{
  "terms": {
	"OrderId" :  ["{{{ctx.payload.orders_lookup.ordersA}}}"]
  }
}

Note 3 curly braces on OrderId

Josip_Cagalj · July 14, 2020, 8:06am

Thank you very much !!! It works now with the solution you provided!
Now I can get rid of the multifield I've introducet into my mappings because of the query_string search.
You are the man, thank you!

system · August 11, 2020, 8:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query_string breaks search term on space even when keyword tokenizer is used Elasticsearch	2	3430	November 15, 2018
Query string not working with keyword tokenizer Elasticsearch	9	3006	July 6, 2017
Query_string can't find token that _analyze shows is generated, but term query can Elasticsearch	12	604	July 6, 2017
Querystring search: Tokens are out of order Elasticsearch	5	2592	July 6, 2017
Analyzers comparison Elasticsearch	12	600	July 6, 2017

Setting the whitespace analyzer for query_string search

Related topics