I'm working with ES 6.8.1 (and ditto for the rest of the Elastic Stack) that's using an index template that Logstash sends to for indexing. There are some date fields in the index template where I'm seeing the declaration for the date type fields not honoring (or I'm misunderstanding or just doing it plain wrong) the defined date formats. The indices created are done so dynamically by plucking off the year and month (yyyy-MM) from a date field and then appended to the index name, which the index template matches against.
I do have this index template asserting "dynamic":"strict" in the mapping properties, but I feel like this shouldn't have any effect on having multiple date formats on the date fields. And I haven't stumbled on any issues or literature noting that this might cause problems beyond rejecting events from indexing if not all the fields presented are known to the template.
The date fields themselves all look like so:
"some_dtfield": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSSSSSS||yyyy-MM-dd HH:mm:ss"
}
In my Logstash logs I see all events with date fields for the matching index template asserting the date field value is too short. I've tested this by dropping the existing index and letting the workflow recreate it (it's based on month) again to ensure I didn't gloss over any index mapping changes that might've occurred. I also did a quick query on the index mapping schema itself and saw the date fields were as I note above. I get the same output from Logstash about being unable to index after dropping and letting the index recreate.
So this makes me consider two things:
Elasticsearch is doing some sort of short circuit evaluation of the date formats it encounters when attempting to index. If it doesn't match the leftmost format, it naively thinks it's not going to match anything defined afterward (very unlikely).
I'm not defining multiple possible date formats correctly. As far as the documentation goes, I haven't done anything obviously wrong or different compared to both 7.x current and 6.8. Using double pipes to designate an OR condition is the right syntax. I'm not sure why it seems to be short circuit evaluating things.
When I flip the expected date formats around from above, things index properly. However, I'm trying to future proof the index template to anticipate microseconds.
Anything jump out to you all on what's up here? I do believe the assertion for microseconds is valid since the Logstash errors are stating "too short" strongly suggesting it's actually expecting the microseconds to be there. Thanks in advance to any suggestions that crop up!
It might be matching it to the first one it comes across and causing an issue because it thinks it is a match but then doesn't see miliseconds. Is there any indication in the logstash logs why it is failing to parse?
See my original post. I state that Logstash outputs "too short" when it sends the yyyy-MM-dd HH:mm:ss date format event to Elasticsearch where it is rejected.
As I mentioned in OP as well, it seems like the || logic isn't being honored as a short circuit evaluation with multiple potential date formats seems like it's occurring when it shouldn't.
It seems more likely that it is matching the first format and then deciding that it is too short for the pattern because the pattern says it should have micro / nanoseconds. This is much more likely than the OR operator not working.
At any rate, I am not sure of the best way to fix that problem, but you could consider making your own grok pattern with a conditional for if the .SSSSSSS is needed, or add filters with Match => for each case.
It depends on if you really need it to store the extra fractions of seconds or not. You could always make it two separate fields.
You could also mutate the output before sending it to ES to add 7 zeros as placeholders if you want to keep the fractions of seconds, but I think that could be ill advised.
That's actually something I had considered at some point, but the documentation gave the implicit suggestion that potential matching date formats could have varying time parts. Say, for example, having a date field which might expect "format": "yyyy-MM-dd|yyyy-MM-dd HH:mm:ss" which logically is a very similar situation as I was asking about originally - there are multiple valid potential formats of different precision.
It's certainly possible split out the micros into a separate integer field then have whatever's querying that index rebuild the original micro time as needed (if the microseconds field is nonzero). It's cool if that's really the only viable solution, though I would've liked to have somewhat agnostic date fields.
Right, I saw those examples too, but I have noticed the pattern recognition sometimes has to do with whitespace. So a main difference in the example you just provided is that there is a space, and then a difference. Not a difference at the end of string like with the fractions of seconds. Would be good to get an elastic team member to weigh in because I have had to do workarounds for similar issues with pattern recognition before as well.
Definitely. I'm grateful your suggestion is along the same lines as something I'd considered before, but having a definitive explanation to temper expectations officially would really put my mind at ease on the matter (and cement what my alternatives are here on out).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.