I was designing an ES document and thought I'd use a _id : { path : {
... } } specification to pick a value I thought was really unique.
I have confidence that values are all unique. It is a SHA-256 based on
the original file. We have used this in our previous system to Identify
duplicate files.
I decided to index the hash field within an object, so the definition
looks like:
But this doesn't work.
"myDocument": {
_id : { path : "Text.hash" }
,properties: {
"originalPath": { type:"string", store:1,
index:"not_analyzed" }
...
,Text : {
,properties: {
hash: { type:"string", store:1, index:"not_analyzed" }
,length: { type:"long", store:1 }
...
}
}
}
}
When I specify the field to use as one within the Text object, as above,
the index ends up with 1 document in it! And it's _id is "[".
http://localhost:9200/_search?pretty=1
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
-
"max_score" : 1.0,"total" : 1,*
"hits" : [ {
"_index" : "dragon0",
"_type" : "myDocument",
"_id" : "[",
"_score" : 1.0
} ]
}
}
If I move the hash field outside of the subobject Text, I get unique
documents.
So This does work.
"myDocument": {
_id : { path : "hash" }
,properties: {
"originalPath": { type:"string", store:1,
index:"not_analyzed" }
,hash: { type:"string", store:1, index:"not_analyzed" }
...
,Text : {
properties: {
...
,length: { type:"long", store:1 }
...
}
}
}
}
If I again search for everything with.
http://localhost:9200/_search?pretty=1
I get all the 200 documents I inserted.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
-
"max_score" : 1.0,"total" : 200,*
"hits" : [ {
"_index" : "dragon0",
"_type" : "myDocument",
"_id" : "FjhwleNbQhyDh4XzB7k7BA",
"_score" : 1.0
}, {
"_index" : "dragon0",
"_type" : "myDocument",
"_id" : "OVqLwly7RKmQ6R59q8JQOA",
"_score" : 1.0
}, {
...
I think this is a bug. If not please
Is this enough information to recreate the problem? I didn't have time
to gist it today.
-Paul
--