Index space usage with text and keyword datatype

Dear Elastic community,

We have a logging use case where the message field is quite big, specifically Windows logs collected using winlogbeat. We currently index the message field as text and also as keyword. We were curious if we could save disk space by disabling they keyword indexing. I tested this on Elasticsearch 6.2.4. The savings are significant:

without message.keyword: 1.1 GiB (1145415815 byte)
With message.keyword: 2.5 GiB (2699312441 byte)

I did the test by reindexing from an old daily index into two new indices. One with message.keyword disabled.

ECS also defines the message field as text so I guess it is the new recommendation to index the message field only as text.

There is only one issue we would have when disabling the keyword indexing and that is a watch (which is public, ref: https://github.com/elastic/examples/pull/240) that does a terms aggregation on the message field.

Options:

  1. Leave it as is, and potentially Elasticsearch has improved in more recent versions or will improve so that it does not matter that much when both datatypes are used. Is there something known in this regard?
  2. Disable keyword for the Windows logs and update the watch, potentially have a second watch, which does the message field aggregation in Painless for example. Do you have some hints for this approach? Do it in Painless or is there some trick that Elasticsearch can do it?
  3. Disable keyword for the Windows logs and update the watch by removing the message aggregation and just assuming that every message is unique.
  4. Have the collection pipeline calculate a hash over the message, index that as keyword and use it for the aggregation.

Any feedback is welcome.

I'm not too familiar with the details of ECS but generally mapping large free-text fields as keyword has limited use in my experience. There's a hard Lucene limit of 32766 on indexed values and typically an ignore_above setting is used to avoid index bloat of the type you saw. That also means a lot of docs then have no indexed value.
The idea of a hash is a good one which:

a) Keeps a limit on the size stored in the index
b) Limits the size of values shown in Kibana histograms etc
c) Retains a value for every doc

The downside is a lack of readability in visualization results but you can typically drill-down to raw docs to see the original message. A compromise might be to index the hash and the first N characters of the message for readability e.g.:

[xxMyHashxxx] Error reading file ....

Thanks very much for your feedback!

Option 4, the one with the hash is also my favorite but I wanted to present the options
unbiased.

That is an interesting addition to option 4. I think for my use case this is not needed because the "message_hash" field would not be intended for humans to look at and in the watch Painless script I can retrieve the full message. I guess it can be helpful for others so thanks for mentioning it!

This should solve the issue for me. I will need to test and implement when I get around to it.

1 Like