How do I set a field type using grok?


(Eric) #1

I know that %{NUMBER:field:integer} works, actually I think this works, too: %{NUMBER:field:int}

I use the grok filter heavily and would like to know if I can set the field type in a custom pattern without needing to use mutate to convert the field to an integer.

For instance, will any of these work by design?

(?<field>regex):integer
or
((?<field>regex):integer)
or
(?<field:integer>regex)


(Ed) #2

The nice way of doing something like this is

%{NUMBER:fieldname:int}


(Aaron Mildenstein) #3

I don't think that the :integer type-casting will work, but you should make some simple tests with output { stdout { codec => rubydebug } } to find out.

While Logstash does some typing within (e.g. :int, :float, and :string (the default type))—and these are great for typing and comparison operations—Elasticsearch does it much more broadly, and to greater effect through manually mapping the core numeric types:

The type of the number. Can be float, double, integer, long, short, byte.

As with java typing, a float will take up less "space" than a double, and the same is true of the integer types: long > integer > short > byte.

When you declare an integer with :int or mutate convert in Logstash, it will appear to be an integer in the JSON sent to Elasticsearch. Likewise, :float will appear to be a floating point value in the JSON. The trouble is that Elasticsearch doesn't know the scale you intend, so it guesses double for any floating point value, and long for any integer value (see the first line of the Core Types documentation). This guesswork on Elasticsearch's part doesn't usually hamper anyone, but for performance and storage reasons, I'd recommend using the smallest type that will fit your data—if you can manually map and type your fields, that is.


(Eric) #4

Very, very explanatory! Eventually, the storage sizes and performance needed will be likely with hundreds of terabytes. The current working datasets aren't that large, but the strains can already be felt with a moderately beefy VM (java core dumps, out of memory).

One thing I haven't tested out yet is the ability to re-index. With that, even if data is initially poorly indexed, re-indexing will allow those indexes to become more useful; without needed to reparse the original data.


(system) #5