Date Detection not always wanted


(jamster) #1

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date scrub,
data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369, important
part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat it as
a string when it's not a good date... Or worst case, just leave it as a
string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(Shay Banon) #2

Hi,

If you don't want to treat it like date, you can either explicitly set it
in the mapping to be string, or simply set the date_format to "none". But
note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.com wrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date scrub,
data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat it
as a string when it's not a good date... Or worst case, just leave it as a
string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(jamster) #3

Yes, for now that is acceptable... We'll work to ensure our dates are clean
going forward, but for now we need just temp strings anyway.

But, trying to use the REST API to change the mapping, I have been
encountering errors. In just using the samples given on the site, I
modified the mapping to go from date to string:

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '

{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
'

{"ok":true,"_index":"twitter","_type":"tweet","_id":"1"}

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping'
-d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"error":"Merge failed with failures {[mapper [postDate] of different type,
current_type [date], merged_type [string]]}"}

So, I put in the ignore_conflicts flag:

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"ok":true,"acknowledged":true}

But, it still remains as a date:
curl -s -XGET 'http://localhost:9200/_cluster/state?pretty=true'

<<< snip (http://gist.github.com/613450) >>>
"postDate" : {
"omit_term_freq_and_positions" : true,
"index_name" : "postDate",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,

  •            "format" : "dateOptionalTime",*
              "precision_step" : 4,
              "term_vector" : "no",
    
  •            "type" : "date"*
            },
    

<<< end_snip >>>

How do I force it to change?

On Tue, Oct 5, 2010 at 6:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

If you don't want to treat it like date, you can either explicitly set
it in the mapping to be string, or simply set the date_format to "none". But
note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.com wrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date scrub,
data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat it
as a string when it's not a good date... Or worst case, just leave it as a
string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(Shay Banon) #4

You can't force a change of type for a field that has already been mapped.
You will need to reindex and declare the mapping before indexing the data.

On Wed, Oct 6, 2010 at 4:44 PM, Jay Amster jayamster@gmail.com wrote:

Yes, for now that is acceptable... We'll work to ensure our dates are clean
going forward, but for now we need just temp strings anyway.

But, trying to use the REST API to change the mapping, I have been
encountering errors. In just using the samples given on the site, I
modified the mapping to go from date to string:

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '

{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
'

{"ok":true,"_index":"twitter","_type":"tweet","_id":"1"}

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"error":"Merge failed with failures {[mapper [postDate] of different type,
current_type [date], merged_type [string]]}"}

So, I put in the ignore_conflicts flag:

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"ok":true,"acknowledged":true}

But, it still remains as a date:
curl -s -XGET 'http://localhost:9200/_cluster/state?pretty=true'

<<< snip (http://gist.github.com/613450) >>>
"postDate" : {
"omit_term_freq_and_positions" : true,
"index_name" : "postDate",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,

  •            "format" : "dateOptionalTime",*
              "precision_step" : 4,
              "term_vector" : "no",
    
  •            "type" : "date"*
            },
    

<<< end_snip >>>

How do I force it to change?

On Tue, Oct 5, 2010 at 6:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

If you don't want to treat it like date, you can either explicitly set
it in the mapping to be string, or simply set the date_format to "none". But
note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.com wrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date
scrub, data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat it
as a string when it's not a good date... Or worst case, just leave it as a
string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(jamster) #5

Okay, but if the index does not exist, you can't force a mapping on it...

curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '
{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'
{"error":"[twitter] missing"}

I can delete the index, but how do I instantiate it with a mapping without
giving it a document? Every document contains a date, and if that's the
case it will autodetect. I can just pass in a simple document with one
field just to get the Index created, but that seems hackish... Is there any
cleaner way?

-Jason

On Wed, Oct 6, 2010 at 10:46 AM, Shay Banon shay.banon@elasticsearch.comwrote:

You can't force a change of type for a field that has already been mapped.
You will need to reindex and declare the mapping before indexing the data.

On Wed, Oct 6, 2010 at 4:44 PM, Jay Amster jayamster@gmail.com wrote:

Yes, for now that is acceptable... We'll work to ensure our dates are
clean going forward, but for now we need just temp strings anyway.

But, trying to use the REST API to change the mapping, I have been
encountering errors. In just using the samples given on the site, I
modified the mapping to go from date to string:

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d
'

{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
'

{"ok":true,"_index":"twitter","_type":"tweet","_id":"1"}

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"error":"Merge failed with failures {[mapper [postDate] of different
type, current_type [date], merged_type [string]]}"}

So, I put in the ignore_conflicts flag:

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"ok":true,"acknowledged":true}

But, it still remains as a date:
curl -s -XGET 'http://localhost:9200/_cluster/state?pretty=true'

<<< snip (http://gist.github.com/613450) >>>
"postDate" : {
"omit_term_freq_and_positions" : true,
"index_name" : "postDate",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,

  •            "format" : "dateOptionalTime",*
              "precision_step" : 4,
              "term_vector" : "no",
    
  •            "type" : "date"*
            },
    

<<< end_snip >>>

How do I force it to change?

On Tue, Oct 5, 2010 at 6:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

If you don't want to treat it like date, you can either explicitly set
it in the mapping to be string, or simply set the date_format to "none". But
note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.com wrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date
scrub, data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat
it as a string when it's not a good date... Or worst case, just leave it as
a string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(Shay Banon) #6

You can create the index first using curl -XPUT localhost:9200/twitter.

On Wed, Oct 6, 2010 at 4:54 PM, Jay Amster jayamster@gmail.com wrote:

Okay, but if the index does not exist, you can't force a mapping on it...

curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '
{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'
{"error":"[twitter] missing"}

I can delete the index, but how do I instantiate it with a mapping without
giving it a document? Every document contains a date, and if that's the
case it will autodetect. I can just pass in a simple document with one
field just to get the Index created, but that seems hackish... Is there any
cleaner way?

-Jason

On Wed, Oct 6, 2010 at 10:46 AM, Shay Banon shay.banon@elasticsearch.comwrote:

You can't force a change of type for a field that has already been mapped.
You will need to reindex and declare the mapping before indexing the data.

On Wed, Oct 6, 2010 at 4:44 PM, Jay Amster jayamster@gmail.com wrote:

Yes, for now that is acceptable... We'll work to ensure our dates are
clean going forward, but for now we need just temp strings anyway.

But, trying to use the REST API to change the mapping, I have been
encountering errors. In just using the samples given on the site, I
modified the mapping to go from date to string:

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d
'

{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
'

{"ok":true,"_index":"twitter","_type":"tweet","_id":"1"}

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"error":"Merge failed with failures {[mapper [postDate] of different
type, current_type [date], merged_type [string]]}"}

So, I put in the ignore_conflicts flag:

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"ok":true,"acknowledged":true}

But, it still remains as a date:
curl -s -XGET 'http://localhost:9200/_cluster/state?pretty=true'

<<< snip (http://gist.github.com/613450) >>>
"postDate" : {
"omit_term_freq_and_positions" : true,
"index_name" : "postDate",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,

  •            "format" : "dateOptionalTime",*
              "precision_step" : 4,
              "term_vector" : "no",
    
  •            "type" : "date"*
            },
    

<<< end_snip >>>

How do I force it to change?

On Tue, Oct 5, 2010 at 6:58 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Hi,

If you don't want to treat it like date, you can either explicitly
set it in the mapping to be string, or simply set the date_format to "none".
But note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.comwrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date
scrub, data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ]; nested:
MapperParsingException[Failed to parse [loggins.loggedInOn]]; nested:
IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat
it as a string when it's not a good date... Or worst case, just leave it as
a string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(jamster) #7

Perfect... This works great. Thanks!!!

On Wed, Oct 6, 2010 at 10:58 AM, Shay Banon shay.banon@elasticsearch.comwrote:

You can create the index first using curl -XPUT localhost:9200/twitter.

On Wed, Oct 6, 2010 at 4:54 PM, Jay Amster jayamster@gmail.com wrote:

Okay, but if the index does not exist, you can't force a mapping on it...

curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d '
{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'
{"error":"[twitter] missing"}

I can delete the index, but how do I instantiate it with a mapping without
giving it a document? Every document contains a date, and if that's the
case it will autodetect. I can just pass in a simple document with one
field just to get the Index created, but that seems hackish... Is there any
cleaner way?

-Jason

On Wed, Oct 6, 2010 at 10:46 AM, Shay Banon <shay.banon@elasticsearch.com

wrote:

You can't force a change of type for a field that has already been
mapped. You will need to reindex and declare the mapping before indexing the
data.

On Wed, Oct 6, 2010 at 4:44 PM, Jay Amster jayamster@gmail.com wrote:

Yes, for now that is acceptable... We'll work to ensure our dates are
clean going forward, but for now we need just temp strings anyway.

But, trying to use the REST API to change the mapping, I have been
encountering errors. In just using the samples given on the site, I
modified the mapping to go from date to string:

jamster@jamster:~$ curl -XPUT 'http://localhost:9200/twitter/tweet/1'
-d '

{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
'

{"ok":true,"_index":"twitter","_type":"tweet","_id":"1"}

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping' -d '

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"error":"Merge failed with failures {[mapper [postDate] of different
type, current_type [date], merged_type [string]]}"}

So, I put in the ignore_conflicts flag:

jamster@jamster:~$ curl -XPUT '
http://localhost:9200/twitter/tweet/_mapping?ignore_conflicts=true' -d
'

{
"tweet" : {
"properties" : {
"postDate" : {"type" : "string"}
}
}
}
'

{"ok":true,"acknowledged":true}

But, it still remains as a date:
curl -s -XGET 'http://localhost:9200/_cluster/state?pretty=true'

<<< snip (http://gist.github.com/613450) >>>
"postDate" : {
"omit_term_freq_and_positions" : true,
"index_name" : "postDate",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,

  •            "format" : "dateOptionalTime",*
              "precision_step" : 4,
              "term_vector" : "no",
    
  •            "type" : "date"*
            },
    

<<< end_snip >>>

How do I force it to change?

On Tue, Oct 5, 2010 at 6:58 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Hi,

If you don't want to treat it like date, you can either explicitly
set it in the mapping to be string, or simply set the date_format to "none".
But note that the power to search on it will be limited to the lexical
representation of that date, and not as an actual date.

-shay.banon

On Tue, Oct 5, 2010 at 11:27 PM, Jay Amster jayamster@gmail.comwrote:

Hello,

I seem to be having with the way ES auto determines a fields dateness.

We have massive sets of data, that while we try our hardest to date
scrub, data with bad dates sometimes do get through.

Because our first record creates the indexing scheme, we see that it
auto-detects the field as date, for example:

curl -XPUT 'http://localhost:9200/date/person/1' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-04"
    }
  ]
}'

This produces the following schema: http://gist.github.com/612369,
important part being:
<<< snip> >>
"loggins" : {
"dynamic" : true,
"enabled" : true,
"date_formats" : [ "dateOptionalTime", "yyyy/MM/dd
HH:mm:ss||yyyy/MM/dd" ],
"path" : "full",
"properties" : {
"loggedInOn" : {
"omit_term_freq_and_positions" : true,
"index_name" : "loggedInOn",
"index" : "not_analyzed",
"omit_norms" : true,
"store" : "no",
"boost" : 1.0,
"format" : "dateOptionalTime",
"precision_step" : 4,
"term_vector" : "no",
"type" : "date"
}
},
<<< end snip >>>

Then let's say another record comes through with a bad date:

curl -XPUT 'http://localhost:9200/date/person/2' -d '{
  "firstName": "Jason",
  "lastName": "Amster",
  "SID": "1",
  "loggins": [
    {
      "loggedInOn": "2010-10-32"
    }
  ]
}'

We get the following error:

{"error":"ReplicationShardOperationFailedException[[date][0] ];
nested: MapperParsingException[Failed to parse [loggins.loggedInOn]];
nested: IllegalFieldValueException[Cannot parse "2010-10-32": Value 32 for
dayOfMonth must be in the range [1,31]]; "}

I would love for it to know it's a date when it's good, but only treat
it as a string when it's not a good date... Or worst case, just leave it as
a string no matter what, at least until we can come up with a bettor more
robust date scrubbing strategy?

Kind Regards,
Jason Amster


(system) #8