Dec 23rd, 2020: [EN] New additions to the keyword family: constant_keyword and wildcard

Spanish version

We’ve recently introduced two additional keyword types, wildcard and constant_keyword. In this post, we’ll try to briefly introduce them.

The wildcard field is optimized for matching any part of string values using wildcards or regular expressions. The usual use case is for security when we might be searching for a pattern in a process, or to run grep-like queries on log lines that have not been modeled into different fields.

This was introduced in version 7.9. and we’ll demonstrate this with a basic example. We’ll be using Kibana sample data “Sample web logs” on a Kibana 7.10.0..

We would like to get all the different zip files that users downloaded from our web site. With the current sample data, we could run the following query:

GET kibana_sample_data_logs/_search?filter_path=aggregations.zip-downloads.buckets.key
{
  "size": 0, 
  "_source": "request", 
  "query": {
    "wildcard": {
      "url.keyword": {
        "value": "*downloads*.ZIP",
        "case_insensitive": true
      }
    }
  },
  "aggs": {
    "zip-downloads": {
      "terms": {
        "field": "url.keyword",
        "size": 10
      }
    }
  }
}

Which is using the field url.keyword, already existing in the index kibana_sample_data_logs, to run the query. And would return:

{
  "aggregations" : {
    "zip-downloads" : {
      "buckets" : [
        {
          "key" : "https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.2.zip"
        },
        {
          "key" : "https://artifacts.elastic.co/downloads/apm-server/apm-server-6.3.2-windows-x86.zip"
        },
        {
          "key" : "https://artifacts.elastic.co/downloads/kibana/kibana-6.3.2-windows-x86_64.zip"
        }
      ]
    }
  }
}

Once we have the index kibana_sample_data_logs, we can update its mappings to add to the existing fields url (type text) and url.keyword (type keyword), a third field url.wirldcard of type wildcard.

PUT kibana_sample_data_logs/_mappings
{
  "properties" : {
        "url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            },
            "wildcard" : {
              "type" : "wildcard"
            }
          }
        }
      }
}

To populate the field, we can run an update:

POST kibana_sample_data_logs/_update_by_query

And now we can execute the same query, using the url.wildcard field.

GET kibana_sample_data_logs/_search?filter_path=aggregations.zip-downloads.buckets.key
{
  "size": 0, 
  "_source": "request", 
  "query": {
    "wildcard": {
      "url.wildcard": {
        "value": "*downloads*.ZIP",
        "case_insensitive": true
      }
    }
  },
  "aggs": {
    "zip-downloads": {
      "terms": {
        "field": "url.keyword",
        "size": 10
      }
    }
  }
}

What we’ve done is replace a costly leading wildcard query on a keyword field, with a leading wildcard search on a wildcard field.

How does this help us?

  • If we have a field with high cardinality, a keyword is not optimized for leading wildcard searches, and thus it will have to scan all the registries. Wildcard fields index the whole field value using ngrams and also store the full string. Combing both data structures in the search we can speed up the search.
  • String lengths have a limit. Lucene has a hard limit of 32k on terms, and Elasticsearch imposes an even lower limit. That means the content is dropped from the index if you reach that size. For example, a stack trace can be lengthy. And in some cases, like security, this can create blind spots that are not acceptable.

What you see here, is that the same search we had with the keyword field will work on the new wildcard field. In this basic case, the speed will be similar, as we do not have high cardinality in the url field values for this small data sample. Though you get how it works.

When to use a wildcard? What’s the trade-off? In this post we won't get into more details. To further dig into this, go ahead and check the following resources:

Moving on to the second addition to the keyword family, the constant_keyword field, available since version 7.7.0..

This is a field we can use in cases where we want to speed up filter searches.

Usually, the more documents a filter matches, the more that query will cost. For example, if we send all application logs collected for different dev teams to the same index, we might find that later we usually filter them based on the team. If we know this is a usual filter for our use case, we could decide to ingest data on different indices, based on the team field value on each log line. To make queries faster, as they would hit just indices with matching data for that team.

And we can go one step further. Maybe we do not want to change our client's logic, the way they query. We'd like the same queries to work, but still filter out more effectively the indices that have data that won't match.

If we create those indices with a constant_keyword field, and set the constant in the mappings, the queries sent to all indices will make use of that field to discard indices that won't match.

We’ll go over it with a basic example to demonstrate how it works.

We’ll create two indices to hold logs for two different teams, A and B. On each index, we define the field team with a constant value, A or B.

PUT my-logs-team-a
{
  "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "message" : {
          "type" : "text"
        },
        "team" : {
          "type" : "constant_keyword",
          "value": "A"
        }
      }
    }
}
PUT my-logs-team-b
{
  "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "message" : {
          "type" : "text"
        },
        "team" : {
          "type" : "constant_keyword",
          "value": "B"
        }
      }
    }
}

We’ll now ingest a document into each index. We can include the field team value in the document to ingest:

POST my-logs-team-a/_doc/
{
  "@timestamp": "2020-12-04T10:00:13.637Z",
  "message": "239.23.215.100 - - [2018-08-10T13:09:44.504Z] \"GET /apm HTTP/1.1\" 200 8679 \"-\" \"Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24\"",
  "team": "A"
}

Or leave the value for that field out, letting it take the one defined in the mapping:

POST my-logs-team-b/_doc/
{
  "@timestamp": "2020-12-04T10:01:12.654Z",
  "message": "34.52.49.238 - - [2018-08-10T12:23:20.235Z] \"GET /apm HTTP/1.1\" 200 117 \"-\" \"Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24\""
}

We will now be able to run the queries in a more efficient way without selecting the indices that contain the values we want to filter. We can search on my-logs-team-*, and Elasticsearch will do its magic for us:

GET my-logs-team-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "team": "B"
          }
        }
      ]
    }
  }
}

If we run the query on Kibana's "Search Profiler", we can see that when searching for team B, we are executing a match_none query on the team-a index. Thus speeding up the filtering operation.

If we created the indices using a type keyword and ran the same test, we would see how both indices run the same term query, even if one of them will return no results.

Go take the two new keyword types for a spin and let us know how it goes!

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.