String_query doesn't respond to some characters

Hello! I have a misunderstanding of how string_query works.

Index creating:

PUT _index_template/test
{
  "priority": 500,
  "template": {
    "settings": {
      "index.default_pipeline": "set-timestamp"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" }
      }
    }
  },
  "index_patterns": ["test*"]
}

Filling the index with data:

POST test/_bulk
{"index":{}}
{"message":"00 00 00", "type": "positive"}
{"index":{}}
{"message":"00_00_00", "type": "false positive"}
{"index":{}}
{"message":"00-00-00", "type": "false positive"}

Request:

GET test/_search
{
    "query": {
        "bool": {
            "must": [
                {
                  "query_string": {
                    "query": "/ <regexp> /",
                    "fields": ["message"],
                    "allow_leading_wildcard": "true",
                    "analyze_wildcard": "true",
                    "boost": 1
                  }
                },
                {
                    "range": {
                        "@timestamp": {
                            "from": "now-10d",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "aggregations": {
        "indexes": {
            "terms": {
                "field": "_index",
                "size": 10,
                "min_doc_count": 1,
                "shard_min_doc_count": 0,
                "show_term_doc_count_error": false,
                "order": [
                    {
                        "_count": "desc"
                    },
                    {
                        "_key": "asc"
                    }
                ]
            }
        }
    }
}

Can someone explain to me why data with " " and with "-" are ignored. The message field containing these characters in the text is not in the response, even if they are explicitly specified in the regexp syntax.

Here is the regexps i tried:

"query": "/00[-_ ]{1}00[-_ ]{1}00/"
"query": "/00[\\s\\W_]00[\\s\\W_]00/"
"query": "/00.?00.?00/"
"query": "/00.*00.*00/"

Response always contained only document with value 00_00_00.

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2,
    "hits": [
      {
        "_index": "test",
        "_id": "1ic-ZIkBBNQA9kiNXSvQ",
        "_score": 2,
        "_source": {
          "@timestamp": "2023-07-17T14:25:55.960338327Z",
          "message": "00_00_00",
          "type": "positive"
        }
      }
    ]
  },
  "aggregations": {
    "indexes": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "logcrusher-test",
          "doc_count": 1
        }
      ]
    }
  }
}

I did not have such problems with the script method, but it, unfortunately, is not suitable for my project.

I understand that this is tokenization, but this understanding does not help me :grin:

Probably the solution would be to disable field tokenization somehow, or tokenize the entire string, but I'm afraid this might cause problems with keyword limits.

I found a solution:

((/[2,4,5,8]{1}[0-9]{3}/) AND (/[0-9]{4}/) AND (/[0-9]{4}/) AND (/[0-9]{4}/)) OR (/[2,4,5,8]{1}[0-9]{3}([^0-9a-Z]?[0-9]{4}){3}/)

Concatenation with AND and OR solves the whitespace problem, whereas you can use regexp in (/.../)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.