Data type for Log Message Fields, does keyword add overhead?

Currently I'm indexing log messages into Elasticsearch. Generally the initial 'message' field is dissected and stored in the following fields: @timestamp(date), logLevel(keyword), logMessage(keyword/text).

                "logMessage": {
                    "fields": {
                        "keyword": {
                            "ignore_above": 10000,
                            "type": "keyword"
                        }
                    },
                    "type": "text"
                },

The logMessage field has millions of variations and is not used for aggregations. We perform a lot of match_phrase, wildcard and query_string queries verse this field.

Does saving this as a keyword cause issues? Or initial overhead? Should it be set to match_text or text only?

Storing this as keyword if you already is storing as text will use basically doube the disk space usage for each document.

Since you do not need to use it for aggregations, you can map it as a text or as match_text_only, but since you said you use match_phrase, it is better to keep it as a text.

From the match_text_only documentation you have this information:

however queries that need positions such as the match_phrase query perform slower as they need to look at the _source document to verify whether a phrase matches.

Issues with storing log messages in text or keyword fields was the primary reason for creating the wildcard field - see Find strings within strings faster with the Elasticsearch wildcard field | Elastic Blog

Thanks everyone. Now I’m trying to decide on using text of wildcard.

90% of my queries on this field are probably match phrase where we want to match multiple words, example “server failed to respond to baseball fastball”. The remaining 10 are split between query string and wildcard.

I’m having trouble getting wildcard searches to return results on the text field, not sure if it should be supported.(same query would work vs logMessage.keyword)

Also my use of query string might be wrong, but generally I want to use it to do AND match phrase queries but it always returns log messages I don’t want due to the way it tokenizes the message. “server failed AND baseball fastball” returns a lot of false positives.

A word-based index like the text field relies on matching word sequences using phrase queries or the more sophisticated JSON queries like ‘span’ or ‘interval’.
Untokenised fields like keyword or wildcard fields are a single string token and you would instead use ‘wildcard’ or ‘regexp’ queries to match character sequences in those strings.

Thanks Mark.

PUT message_test_index
{
  "mappings": {
    "properties": {
      "@timestamp" : {
        "type" : "date"
      },
      "messageWildcard" : {
        "type" : "wildcard"
      },
      "messageKeyword" : {
        "type" : "keyword"
      },
      "messageText" : {
        "type" : "text"
      },
      "messageMatchOnlyText" : {
        "type" : "match_only_text"
      }
    }
  }
}

POST message_test_index/_doc
{
  "@timestamp": "2022-01-01T12:10:30Z",
  "messageWildcard": "This is example of search query. Can anyone help me?",
  "messageKeyword": "This is example of search query. Can anyone help me?",
  "messageText": "This is example of search query. Can anyone help me?",
  "messageMatchOnlyText": "This is example of search query. Can anyone help me?"
}

GET message_test_index/_search 
{
  "query": {"match_phrase": {
    "messageWildcard": "is example of"
  }}
}

GET message_test_index/_search 
{
  "query": {"wildcard": {
    "messageWildcard": "*example of*"
  }}
}

GET message_test_index/_search 
{
  "query": {"regexp": {
    "messageWildcard": {"value": ".*is example of.*"}
  }}
}

GET message_test_index/_search 
{
  "query": {"match_phrase": {
    "messageText": "is example of"
  }}
}

GET message_test_index/_search 
{
  "query": {"wildcard": {
    "messageText": "*example of*"
  }}
}

GET message_test_index/_search 
{
  "query": {"regexp": {
    "messageText": {"value": ".*is example of.*"}
  }}
}

Currently I have a service that reads/parses user generated search terms. Currently I have a keyword and text field but think I want wildcard. Most of the patterns are currently defined as match_phrase searches, example "is example of"

I'm not sure if it is possible to make "is example of" work as a search term verse the wildcard data type without adding *'s at the start and end of the query.

Word-sequence searches on text fields assume the words to be found can appear anywhere in the text.
Character-sequence searches on keyword or wildcard fields assume the characters are rooted at the beginning and cover to the end of the string unless you use * characters to un-anchor them.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.