Dear Elastic community,
We have a logging use case where the message field is quite big, specifically Windows logs collected using winlogbeat. We currently index the message field as text and also as keyword. We were curious if we could save disk space by disabling they keyword indexing. I tested this on Elasticsearch 6.2.4. The savings are significant:
without message.keyword: 1.1 GiB (1145415815 byte)
With message.keyword: 2.5 GiB (2699312441 byte)
I did the test by reindexing from an old daily index into two new indices. One with message.keyword disabled.
ECS also defines the message field as text so I guess it is the new recommendation to index the message field only as text.
There is only one issue we would have when disabling the keyword indexing and that is a watch (which is public, ref: https://github.com/elastic/examples/pull/240) that does a terms aggregation on the message field.
Options:
- Leave it as is, and potentially Elasticsearch has improved in more recent versions or will improve so that it does not matter that much when both datatypes are used. Is there something known in this regard?
- Disable keyword for the Windows logs and update the watch, potentially have a second watch, which does the message field aggregation in Painless for example. Do you have some hints for this approach? Do it in Painless or is there some trick that Elasticsearch can do it?
- Disable keyword for the Windows logs and update the watch by removing the message aggregation and just assuming that every message is unique.
- Have the collection pipeline calculate a hash over the message, index that as keyword and use it for the aggregation.
Any feedback is welcome.