Hello,
I seem to be having with the way ES auto determines a fields dateness.
We have massive sets of data, that while we try our hardest to date scrub,
data with bad dates sometimes do get through.
Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:
curl -XPUT 'http://localhost:9200/date/person/1' -d '{
"firstName": "Jason",
"lastName": "Amster",
"SID": "1",
"loggins": [
{
"loggedInOn": "2010-10-04"
}
]
}'
This produces the following schema: http://gist.github.com/612369, important
part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>
Then let's say another record comes through with a bad date:
curl -XPUT 'http://localhost:9200/date/person/2' -d '{
"firstName": "Jason",
"lastName": "Amster",
"SID": "1",
"loggins": [
{
"loggedInOn": "2010-10-32"
}
]
}'
We get the following error:
{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}
I would love for it to know it's a date when it's good, but only treat it as
a string when it's not a good date... Or worst case, just leave it as a
string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?
Kind Regards,
Jason Amster