Logstash 2.x : Dynamic Mapping

Hi,
Based on the breaking changes in elasticsearch 2,0 and other information:

https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_mapping_changes.html

...its apparent that fields names with leading underscores and/or containing dots are a bad thing. I have observed that logstash's will generate a mapping against fields with leading underscores.

Are there plans to make logstash's mapping logic more aware of the elasticsearch schema? And, in the meantime, how can I handle unstructured log data coming in that is quite likely to occasionally break both of the above rules?

And finally, are there any other restrictions I should be mindful of with field naming etc?

Regards,
David

You can use the de-dot plugin to help here.

What do you mean by "logstash's mapping logic"? Logstash just emits JSON documents according to the rules that you set up.

I think I must have been misunderstanding something then. I thought that logstash was creating new mappings in elasticsearch for new data streams. It must be elasticsearch doing that itself then...

The de_dot plugin only deals with dots. If fields come in that clash with meta-field names that can cause all sorts of problems too as we saw when we received log output that contained an '_uid' field of type string...

I think I must have been misunderstanding something then. I thought that logstash was creating new mappings in elasticsearch for new data streams. It must be elasticsearch doing that itself then...

Yes, ES chooses how to map fields on its own. However, Logstash by default does provide an index template for logstash-* indexes with rules for the mapping that ES should apply, so it's not completely black and white. You can of course modify the index template so it fits your data.

The problem we have is that we don't know up front what the format of the inbound data is...

This should help with that

output{
	stdout{ codec => rubydebug }
}

We don't know exactly how many log sources we have (a lot) or their format (mostly bespoke). So sending data to std out is likely to lead to data overload aside from the hit on throughput.

You only need to look at one log file (or better yet, one log event) using file input. And you don't have to run this on your production logstash node if performance is a concern to you (I don't imagine running one event through would cause that significant of an impact).

p.s. If you don't know the log format...what are you planning on sending to ES? Just the raw message field? May we see your current logstash config. maybe it will make more sense.