I have been playing with ElasticSearch (wow! great stuff guys)
I was working on defining a mapping for documents I already have in
another Lucene index.
The old index has a field called "path"
"path" : { "type":"string", "omit_term_freq_positions":false,
"index":"not_analyzed", "store":"yes", "term_vector":"no",
"omit_norms":"false" },
It is not only an actual file system path that was the source for the
Lucene document (aka the equivalent of an _attachment in ElasticSearch),
but it is the unique ID for the document.
QUESTION 1:
Is it OK to path the _id to the "path" property?
"_id": { "path":"path"}, <-- two uses of the word path
Or Should I add an "|index_name" to |"path"?
"path" : { "index_name":"filepath", "type":"string", ... },
So here is the mapping assuming the field/property "path" is OK as is.
{
"myDocument" {
"_id": { "path":"path"},
...
"properties" {
...
"path" : { "type":"string", ...},
...
}
}
Here it is with a different internal name.
{
"myDocument" {
"_id": { "path":"filepath"},
...
"properties" {
...
"path" : { "index_name":"filepath", "type":"string", ...},
...
}
}
QUESTION 2:
Is this a correct way to point _id at what come in a document as "path",
but is now in the index as "filepath"?
QUESTION 3:
If I do provide an "index_name":"filepath" when I use such a field in
another operation, for example a search, do I then refer to the field as
"path" or "filepath"?
Have a look at gist:3141520 · GitHub which should demonstrate
the answers to your questions.
On Wed, 2012-07-18 at 15:21 -0700, P. Hill wrote:
It is not only an actual file system path that was the source for the
Lucene document (aka the equivalent of an _attachment in
Elasticsearch), but it is the unique ID for the document.
QUESTION 1:
Is it OK to path the _id to the "path" property?
"_id": { "path":"path"}, <-- two uses of the word path
It is OK
Or Should I add an "index_name" to "path"?
"path" : { "index_name":"filepath", "type":"string", ... },
You can do that too
QUESTION 2:
Is this a correct way to point _id at what come in a document as
"path", but is now in the index as "filepath"?
No. This doesn't work. You have to use the "real" field name.
QUESTION 3:
If I do provide an "index_name":"filepath" when I use such a field in
another operation, for example a search, do I then refer to the field
as "path" or "filepath"?
Have a look at gist:3141520 · GitHub which should demonstrate
the answers to your questions.
Sweet! Thanks for the examples.
In fact you went further and also named your document "path" instead of
"myExample".
Before I saw this I was confused by your example of "path.path". That
seems excessive to call the entry "path" even though it shows that there
is no collision in the use of the word in any of the locations.
I also learned that my example was wrong, properties are NESTED in "_id"
and don't come after like the definitions for "_all" or "_source". This
is not obvious from the web page about the id field. Elasticsearch Platform — Find real-time answers at scale | Elastic
You ended with
You can use: path, path.path, or filepath
Which I believe was in answer to my Question 3:
"If I do provide an "index_name":"filepath" ... [in] a search, do I then refer to the field as "path" or "filepath"?"
thus
"path" - refers to the mapping document property named "path".
"path.path" - is a longer version of the same because the mapping defined an entry/document type called "path" and it has a field called "path".
In my case it would be "myDocument.path"
"filepath" - is in the index as "filepath", so that is what I'd see if I brought up the index directly in Luke (?)
But I'm confused by your earlier comment where you said:
No. This doesn't work. You have to use the "real" field name.
This seems to conflict with using "filepath" mentioned above.
I don't know what you mean by "real" field name nor to which "This" refers.
Since your GET _search examples on github using "filepath" returned nothing,
I'm thinking you misspoke and the 1st line of yours which I quoted above should be:
*** You can use: path, or path.path, but the indexname filepath is not useful in either a mapping or in a query. ***
This would then make this line go with the "No, this doesn't work" and actual example on github.
Did I miss something?
Thanks for great examples since I now know how to define an "_id" and understand the use of names a bit more.
Have a look at gist:3141520 · GitHub which should demonstrate
the answers to your questions.
Sweet! Thanks for the examples.
I've just taken another look at my gist, and I seem to have misled you,
by bad nesting Apologies. I've posted a new gist:
Before I saw this I was confused by your example of "path.path". That
seems excessive to call the entry "path" even though it shows that there
is no collision in the use of the word in any of the locations.
I also learned that my example was wrong, properties are NESTED in "_id"
and don't come after like the definitions for "_all" or "_source". This
is not obvious from the web page about the id field. Elasticsearch Platform — Find real-time answers at scale | Elastic
This was where my nesting error was. properties DO come after _id.
You can use: path, path.path, or filepath
Turns out that, with the right mapping you can also use path.filepath
But I'm confused by your earlier comment where you said:
No. This doesn't work. You have to use the "real" field name.
This seems to conflict with using "filepath" mentioned above.
I don't know what you mean by "real" field name nor to which "This" refers.
This wasn't terribly clear. What I meant was: when you define the path
for the _id field in the mapping, you have to use the real name (ie
'path') not the index_name (ie 'filepath')
I've just taken another look at my gist, and I seem to have misled you,
by bad nesting Apologies. I've posted a new gist: gist:3152797 · GitHub
Thanks again.
Wow, but but but, the old mapping (and queries) when using path worked!
Yikes, I don't understand why at all. I'm not sure if that is a bug or
a feature!
OK, so what we've learned so far:
Despite the use of the word "path" in various definitions, "path"
works as the name of item/document mapping or a field name.
In the mapping and you want to path to a field, you must use the
Elasticsearch search name, not any index_name.
When querying, you may use any of them "path.path", "path.filepath"
(the index_name), "path".
_id definition comes before (is a sibling of) the properties
definition.
Related to point #1 above, just last week I did manage to not completely
edit a curl very well and ended up creating an entry with the ID of
_search by a bad URL something like:
This gets to another question related to the original using of path as
an ID.
Do I want to use path as an ID? It is kind of hard to end a URL with an
entire URL is it not? Is it not going to confuse the
URL parser on the server if there more slashes? What about backslashes,
if the original file was a windows file. That might mess up a client
command parser where I'm using CURL (cygin or actual Linux command
windows) or the server side interpretation. Maybe that is too much a
can of worms.
Maybe I should ignore the use of path as an ID, keep path as a known
unique key in my document, but also let ES assign a shorter and easier
to use _id
just in case I do need to paste the _id into a manual CURL command a
script etc.
Wow, but but but, the old mapping (and queries) when using path worked!
Yikes, I don't understand why at all. I'm not sure if that is a bug or
a feature!
Yeah, I consider it a fault that ES isn't more vocal about constructs
that it doesn't understand.
Related to point #1 above, just last week I did manage to not completely
edit a curl very well and ended up creating an entry with the ID of
_search by a bad URL something like:
That, I'd say, is a feature. ES did exactly what you asked it to do.
There's no reason you can't have an ID called "_search"
This gets to another question related to the original using of path as
an ID.
Do I want to use path as an ID? It is kind of hard to end a URL with an
entire URL is it not? Is it not going to confuse the
URL parser on the server if there more slashes? What about backslashes,
Your path-as-id should be URI encoded, then it won't be an issue.
Maybe I should ignore the use of path as an ID, keep path as a known
unique key in my document, but also let ES assign a shorter and easier
to use _id
You can't make any field unique except for the ID, so if you want to
enforce uniqueness, the ID is your only option.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.