Keyword fields, Graylog and ElastAlert

have graylog in front of elasticsearch, getting the following error when trying to generate quick values:

Unable to perform search query\n\nFielddata is disabled on text fields by default. Set fielddata=true on [param_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.

it sounds like the latter is the best route, but wondering what the implications will be. this only involves changing the index mapping, as far as i'm understanding? also, trying to wrap my head around the .keyword vs .raw concept. because we also use elastalert. would like to know how this will affect those rules, and performance, or anything else glaring we should be aware of.

Certain operations in Elasticsearch such as sorting, aggregations and access to field values in scripts, require looking up the terms associated with a document field. This is a different data access pattern than what is needed for full-text search, which needs to search a collection of terms and find the associated documents.

With a text field datatype mapping, index time analysis controlled by the analyzer associated with the field mapping (or the default standard analyzer, if none is defined), will tokenize the input and produce 0-to-many terms that will be inserted into the inverted index. The inverted index is the data structure for full-text search. When it comes to one of the aforementioned operations like sorting and aggregations, the terms in an inverted index would need to be un-inverted in memory to build a columnar data structure, that can then be searched efficiently to provide values needed for the operation. Elasticsearch does not allow this un_inversion process for text datatypes, without explicitly setting fielddata to true because, for a large set of documents, it is possible that this in memory data structure is bigger than the heap allocated to Elasticsearch, potentially leading to the node process running out of memory.

Now, with a keyword field datatype mapping, this columnar data structure can be built and persisted to disk at index time, and is known by its Lucene name, doc_values (Check out the deep dive on doc_values if you're interested to know more). When it comes to an operation that needs to use doc_values, these data structures can be loaded on demand, and leverage the filesystem cache for quick access, no longer constrained by the size of the heap allocated to Elasticsearch.

So, for a string in a JSON object, which one do we use, text or keyword? Well, it depends on what you need to do with the field in Elasticsearch

  1. Need to perform full-text search? Use text datatype
  2. Need to perform term-level matching (like looking for exact values), sorting, aggregations? Use a keyword datatype

Thankfully with Elasticsearch, we don't need to pick one or the other, we can index the field as both a text datatype and a keyword datatype, then use the correct field for the respective operations! This is where fields a.k.a. multi_fields come in. In the mapping for field, we can specify that the field should be indexed in multiple ways

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "param_name": {
          "type": "text",
          "fields": {
            "keyword": { 
              "type":  "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

This maps param_name as both a text datatype and keyword datatype; when we need to target a field for full-text search, we can use "param_name", and when we need to target a field for sorting, aggregations, or the operation in question, we can use "param_name.keyword".

The mapping example that I've shown is the default inferred mapping for a string field in a JSON object, so if you have not explicitly mapped the type in the given index, you should have a "param_name.keyword" that you can perform the operation against.

Either option, you will need to change the mapping. Since this is a mapping change that would affect already indexed documents, you would need to reindex the existing documents for the mapping change to affect them. The Reindex API can help with this.

The preferred option would be to map as a keyword datatype, since it can leverage doc_values. You may already have this keyword mapping though; take a look at the mapping that you have in the index with

GET /<index_name>/_mapping

I'm not familiar with ElastAlert, but essentially, .keyword and .raw are just different names for a multi_field on a mapping; In versions of Elasticsearch before 5.x, raw was the name typically used for a not_analyzed string field (which is what the keyword datatype is in versions prior to 5.x).

3 Likes

Thank you for the detailed response. This is very helpful.

We have a custom mapping, and new indices are being created with that field as a keyword type, instead of text. I added the below:

	  "param_name" : {
		"type" : "keyword"
	  }

When searching the newly created indices with that mapping applied, I am still receiving the "Fielddata is disabled on text fields by default. " message, even though I can see that field when I query the index mapping in ES and see that it is in fact a keyword now instead of text.

Is there something obvious I'm missing?

Would you be able to show the search you're using, and the mapping(s) for the index/indices you're targeting?

It sounds like there's a param_name field somewhere that is still mapped as text.

that's what i thought, too

have done this GET on all new indices created, and see the parameter being mapped as keyword

GET index-name/_mapping?pretty

however, i perform the search in graylog:

_exists_:param_name AND _index:index-name

i try to show quick values on the param_name field, i get the 500 complaining about it being a text field

{
  "index-name": {
    "mappings": {
      "message": {
        "dynamic_templates": [
          {
            "internal_fields": {
              "match": "gl2_*",
              "mapping": {
                "type": "keyword"
              }
            }
          },
          {
            "store_generic": {
              "match_mapping_type": "string",
              "mapping": {
                "type": "keyword"
              }
            }
          }
        ],
        "properties": {
          "BASE10NUM": {
            "type": "keyword"
          },
          "GREEDYDATA": {
            "type": "keyword"
          },
          "HOSTNAME": {
            "type": "keyword"
          },
          "IPORHOST": {
            "type": "keyword"
          },
          "MONTHDAY": {
            "type": "keyword"
          },
          "MONTHNUM": {
            "type": "keyword"
          },
          "NUMBER": {
            "type": "keyword"
          },
          "POSINT": {
            "type": "keyword"
          },

          ....
          
          "param_name": {
            "type": "keyword"
          },

          ....

          
        }
      }
    }
  }
}

Are there any indices that existed before the mapping change was made, that are being targeted by the query?

The following query will get all the mappings for all indices and use filter_path to include only paths to param_name field type mappings

curl -X GET "localhost:9200/_mapping?pretty&filter_path=**.properties.param_name.type"

which will return something like

{
  "index-name-1" : {
    "mappings" : {
      "message" : {
        "properties" : {
          "param_name" : {
            "type" : "keyword"
          }
        }
      }
    }
  },
  "index-name-2" : {
    "mappings" : {
      "message" : {
        "properties" : {
          "param_name" : {
            "type" : "text"
          }
        }
      }
    }
  }
}

This should make it a bit easier to spot which index has param_name mapped as "text".

there are hundreds that were previously mapped as text, but in my query i'm specifically searching the new index (with the keyword field mapping, not text). so i'm confused as to why it's not seeing that.

What about if you use the following query

_exists_:"param_name" AND _index:"index-name"

that's exactly what i'm doing: Keyword fields, Graylog and ElastAlert

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.