we're using serilog to send logs to elastic. it's default mapping parameter is to ignore fields above 256 characters (which might also be an elastic default?). this has been fine for us, but recently one of our devs wanted to filter on a keyword field that was greater than 256 characters and was surprised when they got 0 results.
i found a workaround for them, so it's not a big deal, but this got me to thinking, why does ignore_above exist? what is the philosophy behind this setting, and when would i want to use it or not use it? is it purely performance related, memory management related, or is it necessary because of something i haven't yet thought of?
ignore_above is used with keyword
typically why i will put a large text inside a keyword field ? keyword is used for aggregation purpose and exact match on search
So basically, any string/text sent to elasticsearch will be indexed (by default, if no explicit mapping provided) into 2 fields :
text field with the full content
keyword field with only the first 256 chars
If cours eif you required aggregation on large keyword field, just increase ignore_above in the mapping or remove it simply, but of course this has an impact on memory usage
i understand the difference between text and keyword mappings.
maybe I need to explain how serilog works
you define a messageTemplate in your code with a number of merge fields
the merge fields become individual fields in the document while also filling in a log message
example: messageTemplate: {user} performed {action} and the result is {status}.
results in a document with the following fields.
message: jsmith performed deleteUser and the result is success.
user: jsmith
action: deleteUser
status: success
messageTemplate: {user} performed {action} and the result is {status}.
messageTemplate is set to keyword because we want to aggregate on that.
so what i'm really looking to understand is the performance impact of changing ignore_above i guess?
There is a hard upper limit of 32k - Lucene can't index single terms greater than that length.
The longer the string the more likely there are unique values, the more terms that have to be scanned if you search using anything other than an exact match.
The longer the strings the harder they are to show as labels in a bar chart of top N values etc
In your case using Logstash, an ingest pipeline or some client-side code to parse the message into separate user, action and result fields will give you more opportunity to slice and dice e.g. to ask:
What are the most common actions of user X?
Which action has the most users?
Which actions or users are significantly correlated with "failure' actions ?
etc
you are speaking to my soul.
for now they are just interested in filtering on types of messageTemplates.
i'm trying to push for using more generic and strictly defined terms that would eliminate the need for them to rely on treating full-text style logs as keywords, but it's like herding cats.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.