JSON mapping limitations, was: Getting MapperParsingException while parsing a string and a number?


(Jean-Sebastien Delfino) #1

Thanks!

Comments and more questions inline.

On Tue, Apr 10, 2012 at 3:29 PM, Igor Motov imotov@gmail.com wrote:

By default, elasticsearch tries to deduce field types from filed values.
If you will check mappings after the first PUT request, you will see
something like this:

$ curl 'http://localhost:9200/test/_mapping?pretty=true'
{
"test" : {
"tweet" : {
"properties" : {
* "content" : {*

  •      "dynamic" : "true",*
    
  •      "properties" : {*
    
  •        "text" : { "type" : "string"},*
    
  •        "title" : { "type" : "string"}*
    
  •      }*
      },
      "contentType" : {"type" : "string"},
      "user" : {"type" : "string"}
    }
    
    }
    }
    }

As you can see from this mapping, elasticsearch is now treating content as
an object type field (
http://www.elasticsearch.org/guide/reference/mapping/object-type.html)
and it will be expecting to see objects in this field until index is
deleted. If you ran the second requests first, it would have expected to
see strings there and failed on object.

OK I understand the limitation now. The doc you referenced makes clear that
'once a field has been added, its type can not change'.

There are a couple of ways around it. You can use different field names for

different types:

curl -XPUT 'http://localhost:9200/test/tweet/1' -d '{ "user": "jane doe",
"contentType": "article", "content*_article*": { "title": "some news",
"text": "blah blah" } }'
curl -XPUT 'http://localhost:9200/test/tweet/2' -d '{ "user": "john doe",
"contentType": "url", "content*_url*": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/tweet/3' -d '{ "user": "john doe",
"contentType": "number", "content*_number*": 123 }'

Or you can assign different elasticsearch types to records with different
content types, (different elasticsearch types can have different mappings).

curl -XPUT 'http://localhost:9200/test/article/1' -d '{ "user": "jane
doe", "contentType": "article", "content": { "title": "some news", "text":
"blah blah" } }'
curl -XPUT 'http://localhost:9200/test/url/2' -d '{ "user": "john doe",
"contentType": "url", "content": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/number/3' -d '{ "user": "john
doe", "contentType": "number", "content": 123 }'

Unfortunately, I can't always predict the schema of the documents to index.
Some follow a fixed schema which will never change, some follow a fixed
schema with fields that can take values of different types (as Javascript
and JSON allow that, as well as other programming languages like Java for
example using inheritance), others follow schemas that evolve over time.

I think my use case is a pretty common use case these days: semi-structured
JSON documents with open or evolving schemas.

Also, after a few more tests, I bumped into another serious problem with
arrays. For example a single PUT like this:
curl -XPUT 'http://localhost:9200/test3/mixedarray/1' -d '{ "array": [123, "
http://www.example.com/whatever"] }'
fails with:
{"error":"MapperParsingException[Failed to parse [array]]; nested:
NumberFormatException[For input string: "http://www.example.com/whatever"];
","status":400}

My documents are simple valid JSON, so I'm surprised to bump into these
schema mapping problems after reading on the project home page that
elasticsearch was 'schema-free & document oriented' :slight_smile:

Any thoughts on how to fix this? I'd be happy to help and contribute a
patch if you give me a few pointers and some initial ideas on how to
approach this.

Thanks!

  • Jean-Sebastien

(Shay Banon) #2

The type of the field controls how its going to be indexed, and how later
its going to be queried. It can't really be changed...

On Thu, Apr 12, 2012 at 11:11 PM, Jean-Sebastien Delfino <
jsdelfino@apache.org> wrote:

Thanks!

Comments and more questions inline.

On Tue, Apr 10, 2012 at 3:29 PM, Igor Motov imotov@gmail.com wrote:

By default, elasticsearch tries to deduce field types from filed values.
If you will check mappings after the first PUT request, you will see
something like this:

$ curl 'http://localhost:9200/test/_mapping?pretty=true'
{
"test" : {
"tweet" : {
"properties" : {
* "content" : {*

  •      "dynamic" : "true",*
    
  •      "properties" : {*
    
  •        "text" : { "type" : "string"},*
    
  •        "title" : { "type" : "string"}*
    
  •      }*
      },
      "contentType" : {"type" : "string"},
      "user" : {"type" : "string"}
    }
    
    }
    }
    }

As you can see from this mapping, elasticsearch is now treating content
as an object type field (
http://www.elasticsearch.org/guide/reference/mapping/object-type.html)
and it will be expecting to see objects in this field until index is
deleted. If you ran the second requests first, it would have expected to
see strings there and failed on object.

OK I understand the limitation now. The doc you referenced makes clear
that 'once a field has been added, its type can not change'.

There are a couple of ways around it. You can use different field names

for different types:

curl -XPUT 'http://localhost:9200/test/tweet/1' -d '{ "user": "jane
doe", "contentType": "article", "content*_article*": { "title": "some
news", "text": "blah blah" } }'
curl -XPUT 'http://localhost:9200/test/tweet/2' -d '{ "user": "john
doe", "contentType": "url", "content*_url*": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/tweet/3' -d '{ "user": "john
doe", "contentType": "number", "content*_number*": 123 }'

Or you can assign different elasticsearch types to records with different
content types, (different elasticsearch types can have different mappings).

curl -XPUT 'http://localhost:9200/test/article/1' -d '{ "user": "jane
doe", "contentType": "article", "content": { "title": "some news", "text":
"blah blah" } }'
curl -XPUT 'http://localhost:9200/test/url/2' -d '{ "user": "john
doe", "contentType": "url", "content": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/number/3' -d '{ "user": "john
doe", "contentType": "number", "content": 123 }'

Unfortunately, I can't always predict the schema of the documents to
index. Some follow a fixed schema which will never change, some follow a
fixed schema with fields that can take values of different types (as
Javascript and JSON allow that, as well as other programming languages like
Java for example using inheritance), others follow schemas that evolve over
time.

I think my use case is a pretty common use case these days:
semi-structured JSON documents with open or evolving schemas.

Also, after a few more tests, I bumped into another serious problem with
arrays. For example a single PUT like this:
curl -XPUT 'http://localhost:9200/test3/mixedarray/1' -d '{ "array":
[123, "http://www.example.com/whatever"] }'
fails with:
{"error":"MapperParsingException[Failed to parse [array]]; nested:
NumberFormatException[For input string: "http://www.example.com/whatever"];
","status":400}

My documents are simple valid JSON, so I'm surprised to bump into these
schema mapping problems after reading on the project home page that
elasticsearch was 'schema-free & document oriented' :slight_smile:

Any thoughts on how to fix this? I'd be happy to help and contribute a
patch if you give me a few pointers and some initial ideas on how to
approach this.

Thanks!

  • Jean-Sebastien

(Jean-Sebastien Delfino) #3

On Friday, April 13, 2012 5:33:19 AM UTC-7, kimchy wrote:

The type of the field controls how its going to be indexed, and how later
its going to be queried. It can't really be changed...

I understand that this is a limitation of the current code.

It's a serious limitation IMHO as it prevents elasticsearch to handle:
a) JSON objects with fields containing values of different types;
b) JSON objects with inheritance or object augmentation [1][2], likely to
introduce fields with the same name containing values of different types in
the object tree;
c) JSON arrays containing values of different types (either core types or
unrelated objects introducing fields with the same name but containing
values of different types).

You realize that with (c), elasticsearch can't even handle a simple array
like [ 1, "abc" ], right?

Are you saying that the current code cannot be improved at all to handle
these basic issues?

I was volunteering to help and asking for some ideas and suggestions to
approach this work, hoping that would be fixable...

Are you saying that it's just impossible? Why can't the design evolve a bit
to make type handling more flexible?

[1] http://javascript.crockford.com/prototypal.html
[2] http://www.crockford.com/javascript/inheritance.html

Thanks

  • Jean-Sebastien

On Thu, Apr 12, 2012 at 11:11 PM, Jean-Sebastien Delfino <
jsdelfino@apache.org> wrote:

Thanks!

Comments and more questions inline.

On Tue, Apr 10, 2012 at 3:29 PM, Igor Motov imotov@gmail.com wrote:

By default, elasticsearch tries to deduce field types from filed values.
If you will check mappings after the first PUT request, you will see
something like this:

$ curl 'http://localhost:9200/test/_mapping?pretty=true'
{
"test" : {
"tweet" : {
"properties" : {
* "content" : {*

  •      "dynamic" : "true",*
    
  •      "properties" : {*
    
  •        "text" : { "type" : "string"},*
    
  •        "title" : { "type" : "string"}*
    
  •      }*
      },
      "contentType" : {"type" : "string"},
      "user" : {"type" : "string"}
    }
    
    }
    }
    }

As you can see from this mapping, elasticsearch is now treating content
as an object type field (
http://www.elasticsearch.org/guide/reference/mapping/object-type.html)
and it will be expecting to see objects in this field until index is
deleted. If you ran the second requests first, it would have expected to
see strings there and failed on object.

OK I understand the limitation now. The doc you referenced makes clear
that 'once a field has been added, its type can not change'.

There are a couple of ways around it. You can use different field names

for different types:

curl -XPUT 'http://localhost:9200/test/tweet/1' -d '{ "user": "jane
doe", "contentType": "article", "content*_article*": { "title": "some
news", "text": "blah blah" } }'
curl -XPUT 'http://localhost:9200/test/tweet/2' -d '{ "user": "john
doe", "contentType": "url", "content*_url*": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/tweet/3' -d '{ "user": "john
doe", "contentType": "number", "content*_number*": 123 }'

Or you can assign different elasticsearch types to records with
different content types, (different elasticsearch types can have different
mappings).

curl -XPUT 'http://localhost:9200/test/article/1' -d '{ "user": "jane
doe", "contentType": "article", "content": { "title": "some news", "text":
"blah blah" } }'
curl -XPUT 'http://localhost:9200/test/url/2' -d '{ "user": "john
doe", "contentType": "url", "content": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/number/3' -d '{ "user": "john
doe", "contentType": "number", "content": 123 }'

Unfortunately, I can't always predict the schema of the documents to
index. Some follow a fixed schema which will never change, some follow a
fixed schema with fields that can take values of different types (as
Javascript and JSON allow that, as well as other programming languages like
Java for example using inheritance), others follow schemas that evolve over
time.

I think my use case is a pretty common use case these days:
semi-structured JSON documents with open or evolving schemas.

Also, after a few more tests, I bumped into another serious problem with
arrays. For example a single PUT like this:
curl -XPUT 'http://localhost:9200/test3/mixedarray/1' -d '{ "array":
[123, "http://www.example.com/whatever"] }'
fails with:
{"error":"MapperParsingException[Failed to parse [array]]; nested:
NumberFormatException[For input string: "
http://www.example.com/whatever"]; ","status":400}

My documents are simple valid JSON, so I'm surprised to bump into these
schema mapping problems after reading on the project home page that
elasticsearch was 'schema-free & document oriented' :slight_smile:

Any thoughts on how to fix this? I'd be happy to help and contribute a
patch if you give me a few pointers and some initial ideas on how to
approach this.

Thanks!

  • Jean-Sebastien

(Shay Banon) #4

At the end of the day, you can index everything as strings, and not let the
dyanmic mapping try and derive something to a numeric. For example, "field"
: [1, "test"] will work fine if "field" is a string. Obviously, in this
case, you will loose the option to treat numeric fields as numeric, for
example, range queries will treat them as strings.

On Fri, Apr 13, 2012 at 9:05 PM, Jean-Sebastien Delfino <
jsdelfino@apache.org> wrote:

On Friday, April 13, 2012 5:33:19 AM UTC-7, kimchy wrote:

The type of the field controls how its going to be indexed, and how later
its going to be queried. It can't really be changed...

I understand that this is a limitation of the current code.

It's a serious limitation IMHO as it prevents elasticsearch to handle:
a) JSON objects with fields containing values of different types;
b) JSON objects with inheritance or object augmentation [1][2], likely to
introduce fields with the same name containing values of different types in
the object tree;
c) JSON arrays containing values of different types (either core types or
unrelated objects introducing fields with the same name but containing
values of different types).

You realize that with (c), elasticsearch can't even handle a simple array
like [ 1, "abc" ], right?

Are you saying that the current code cannot be improved at all to handle
these basic issues?

I was volunteering to help and asking for some ideas and suggestions to
approach this work, hoping that would be fixable...

Are you saying that it's just impossible? Why can't the design evolve a
bit to make type handling more flexible?

[1] http://javascript.crockford.com/prototypal.html
[2] http://www.crockford.com/javascript/inheritance.html

Thanks

  • Jean-Sebastien

On Thu, Apr 12, 2012 at 11:11 PM, Jean-Sebastien Delfino <
jsdelfino@apache.org> wrote:

Thanks!

Comments and more questions inline.

On Tue, Apr 10, 2012 at 3:29 PM, Igor Motov imotov@gmail.com wrote:

By default, elasticsearch tries to deduce field types from filed
values. If you will check mappings after the first PUT request, you will
see something like this:

$ curl 'http://localhost:9200/test/_*mapping?pretty=truehttp://localhost:9200/test/_mapping?pretty=true
'
{
"test" : {
"tweet" : {
"properties" : {
* "content" : {

  •      "dynamic" : "true",*
    
  •      "properties" : {*
    
  •        "text" : { "type" : "string"},*
    
  •        "title" : { "type" : "string"}*
    
  •      }*
      },
      "contentType" : {"type" : "string"},
      "user" : {"type" : "string"}
    }
    
    }
    }
    }

As you can see from this mapping, elasticsearch is now treating content
as an object type field (http://www.elasticsearch.org/**
guide/reference/mapping/**object-type.htmlhttp://www.elasticsearch.org/guide/reference/mapping/object-type.html)
and it will be expecting to see objects in this field until index is
deleted. If you ran the second requests first, it would have expected to
see strings there and failed on object.

OK I understand the limitation now. The doc you referenced makes clear
that 'once a field has been added, its type can not change'.

There are a couple of ways around it. You can use different field names

for different types:

curl -XPUT 'http://localhost:9200/test/tweet/1http://localhost:9200/test/tweet/1'
-d '{ "user": "jane doe", "contentType": "article", "content
_article
":
{ "title": "some news", "text": "blah blah" } }'
curl -XPUT 'http://localhost:9200/test/tweet/2http://localhost:9200/test/tweet/2'
-d '{ "user": "john doe", "contentType": "url", "content
_url
": "
http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/tweet/3http://localhost:9200/test/tweet/3'
-d '{ "user": "john doe", "contentType": "number", "content
_number
":
123 }'

Or you can assign different elasticsearch types to records with
different content types, (different elasticsearch types can have different
mappings).

curl -XPUT 'http://localhost:9200/test/article/1' -d '{ "user":
"jane doe", "contentType": "article", "content": { "title": "some news",
"text": "blah blah" } }'
curl -XPUT 'http://localhost:9200/test/url/2' -d '{ "user": "john
doe", "contentType": "url", "content": "http://example.com/foo" }'
curl -XPUT 'http://localhost:9200/test/number/3' -d '{ "user": "john
doe", "contentType": "number", "content": 123 }'

Unfortunately, I can't always predict the schema of the documents to
index. Some follow a fixed schema which will never change, some follow a
fixed schema with fields that can take values of different types (as
Javascript and JSON allow that, as well as other programming languages like
Java for example using inheritance), others follow schemas that evolve over
time.

I think my use case is a pretty common use case these days:
semi-structured JSON documents with open or evolving schemas.

Also, after a few more tests, I bumped into another serious problem with
arrays. For example a single PUT like this:
curl -XPUT 'http://localhost:9200/test3/**mixedarray/1http://localhost:9200/test3/mixedarray/1'
-d '{ "array": [123, "http://www.example.com/**whateverhttp://www.example.com/whatever"]
}'
fails with:
{"error":"MapperParsingException[Failed to parse [array]]; nested:
NumberFormatException[For input string: "http://www.example.com/

whatever\ http://www.example.com/whatever\"]; ","status":400}

My documents are simple valid JSON, so I'm surprised to bump into these
schema mapping problems after reading on the project home page that
elasticsearch was 'schema-free & document oriented' :slight_smile:

Any thoughts on how to fix this? I'd be happy to help and contribute a
patch if you give me a few pointers and some initial ideas on how to
approach this.

Thanks!

  • Jean-Sebastien

(anemitz) #5

Is there a way to then treat all sub fields as strings?

For instance if I have a field, 'custom', which is defined as type: object, path: full -- is there a way to say all fields indexed under the custom object should be treated as strings?


(system) #6