Can't specify a subobject field in a path as the _id field

I was designing an ES document and thought I'd use a _id : { path : {
... } } specification to pick a value I thought was really unique.

I have confidence that values are all unique. It is a SHA-256 based on
the original file. We have used this in our previous system to Identify
duplicate files.
I decided to index the hash field within an object, so the definition
looks like:
But this doesn't work.

"myDocument": {
_id : { path : "Text.hash" }
,properties: {
"originalPath": { type:"string", store:1,
index:"not_analyzed" }
...
,Text : {
,properties: {
hash: { type:"string", store:1, index:"not_analyzed" }
,length: { type:"long", store:1 }
...
}
}
}
}

When I specify the field to use as one within the Text object, as above,
the index ends up with 1 document in it! And it's _id is "[".
http://localhost:9200/_search?pretty=1

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {

  • "total" : 1,*
    
    "max_score" : 1.0,
    "hits" : [ {
    "_index" : "dragon0",
    "_type" : "myDocument",
    "_id" : "[",
    "_score" : 1.0
    } ]
    }
    }

If I move the hash field outside of the subobject Text, I get unique
documents.
So This does work.
"myDocument": {
_id : { path : "hash" }
,properties: {
"originalPath": { type:"string", store:1,
index:"not_analyzed" }
,hash: { type:"string", store:1, index:"not_analyzed" }
...
,Text : {
properties: {
...
,length: { type:"long", store:1 }
...
}
}
}
}

If I again search for everything with.
http://localhost:9200/_search?pretty=1
I get all the 200 documents I inserted.

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {

  • "total" : 200,*
    
    "max_score" : 1.0,
    "hits" : [ {
    "_index" : "dragon0",
    "_type" : "myDocument",
    "_id" : "FjhwleNbQhyDh4XzB7k7BA",
    "_score" : 1.0
    }, {
    "_index" : "dragon0",
    "_type" : "myDocument",
    "_id" : "OVqLwly7RKmQ6R59q8JQOA",
    "_score" : 1.0
    }, {
    ...

I think this is a bug. If not please
Is this enough information to recreate the problem? I didn't have time
to gist it today.
-Paul

--

Hi Paul

I think this is a bug. If not please
Is this enough information to recreate the problem? I didn't have time
to gist it today.

It's quite difficult to follow when the JSON isn't formatted properly
etc. Please can you gist a curl recreation that we can copy and paste
to run, so that we can examine it locally?

ta

clint

-Paul

--

Hi Paul

I think this is a bug. If not please
Is this enough information to recreate the problem? I didn't have time
to gist it today.

It's quite difficult to follow when the JSON isn't formatted properly
etc. Please can you gist a curl recreation that we can copy and paste
to run, so that we can examine it locally?

Actually, it looks like your _id path may be putting to an array? I
would say that you are doing something illegal, but that ES should catch
the problem and throw an error.

Opened as The `_id` path should not allow arrays · Issue #2275 · elastic/elasticsearch · GitHub

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"_id" : {
"path" : "foo.bar"
}
}
}
}
'

THIS WORKS

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"foo" : {
"bar" : 1,
"baz" : "xx"
}
}
'

[Fri Sep 21 11:14:21 2012] Response:

{

"ok" : true,

"_index" : "test",

"_id" : "1",

"_type" : "test",

"_version" : 1

}

THIS DOESN'T

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"foo" : {
"bar" : [
2
],
"baz" : "xx"
}
}
'

[Fri Sep 21 11:14:23 2012] Response:

{

"ok" : true,

"_index" : "test",

"_id" : "[",

"_type" : "test",

"_version" : 1

}

--

After looking at this more, my assumption now is that the real rule may
be that the _id path should point at a numeric value ready for direct
use as an ID.
I think your are right, it should complain about my naive, but illegal
choice of fields to map to the ID, instead of whatever silent
alternative path it took.

-Paul

On 9/21/2012 2:19 AM, Clinton Gormley wrote:

Hi Paul

I think this is a bug. If not please
Is this enough information to recreate the problem? I didn't have time
to gist it today.
It's quite difficult to follow when the JSON isn't formatted properly
etc. Please can you gist a curl recreation that we can copy and paste
to run, so that we can examine it locally?
Actually, it looks like your _id path may be putting to an array? I
would say that you are doing something illegal, but that ES should catch
the problem and throw an error.

Opened as The `_id` path should not allow arrays · Issue #2275 · elastic/elasticsearch · GitHub

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"_id" : {
"path" : "foo.bar"
}
}
}
}
'

THIS WORKS

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"foo" : {
"bar" : 1,
"baz" : "xx"
}
}
'

[Fri Sep 21 11:14:21 2012] Response:

{

"ok" : true,

"_index" : "test",

"_id" : "1",

"_type" : "test",

"_version" : 1

}

THIS DOESN'T

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"foo" : {
"bar" : [
2
],
"baz" : "xx"
}
}
'

[Fri Sep 21 11:14:23 2012] Response:

{

"ok" : true,

"_index" : "test",

"_id" : "[",

"_type" : "test",

"_version" : 1

}

--

Followup to this post, so as not to throw anyone else off. You CAN use
a non-numeric field as the _id, just not one nested withing another
(sub)object at this time, so the bug is either to complain or use the
field properly.

-Paul

On 9/21/2012 10:11 AM, P. Hill wrote:

After looking at this more, my assumption now is that the real rule
may be that the _id path should point at a numeric value ready for
direct use as an ID.
I think your are right, it should complain about my naive, but illegal
choice of fields to map to the ID, instead of whatever silent
alternative path it took.

--