What all are the possible problems if I don't give keyword mapping to my text type?

Hi,

I have a mapping like below;

PUT _template/my_template
{
  "index_patterns": ["mylogs*"],
  "mappings": {
    "doc": {
      "properties": {
        .
		.
		.
        "long_text_field_1": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
			  "ignore_above": 10922
            }
          }
        },
		"long_text_field_2": {
          "type": "text"
        },
        .
		.
		.
      }
    }
  }
}

The fields are long UTF-8 text fields. What all will be the performance / other difference between these two fields, assuming that the size of contents in these 2 fields are same. What all operations are possible / not possible?

Thanks.

Why would you need a single token be up to 10922 characters in length?

  • Users are unlikely to perform exact-match searches by typing in a 10k length string
  • Users are unlikely to select a 10k length string from a drop-down box in a structured form
  • Users are unlikely to want to perform aggregations where a bar on a bar chart is labelled with a 10k length string.

The only scenario I can conceive of is where something like a long URL is stored in a system and you want to perform some analysis on URL usage. You're likely still going to end up with a cripplingly large number of unique strings in your index. Hashes may be a better approach in these circumstances

1 Like

Thanks @Mark_Harwood

My intention was just to search inside those long text fields. Not the entire text, but keywords inside that. In this case, kindly tell me is it OK to go with the below config;

    "long_text_field_1": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
		  "ignore_above": 256
        }
      }
    },
	"long_text_field_2": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
		  "ignore_above": 256
        }
      }
    },

That's perhaps the misinterpretation then. It's "keyword" singular, not plural. A keyword string is treated as a single untokenized keyword like science fiction as opposed to a text string which is tokenized into the words science and fiction.

Ok. So it denotes the size of "each keyword" inside my entire text. So I can very well go with 256 :grinning:

Nope. It dictates the maximum length of a string that will be stored as an untokenized keyword.
"Keyword" does not mean substring. The process of creating substrings for search is tokenization to chop string fields of the type "text" into individual tokens.
Strings of the type keyword are not tokenized.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.