To add to what Drew has said:
I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?
{enabled: false} is different from { index: no}.
String fields accept: { index: no|not_analyzed|analyzed}:
- no: don't index the string
- no_analyzed: index the string exactly as passed in
- analyzed: first analyze the string, then index the resulting
tokens
Other scalar values, eg number, date etc accept: {index: no|analyzed}
where "analyzed" really means "yes". There is no analysis phase for
non-string fields, so not_analyzed vs analyzed is meaningless. We
either index the value or we don't.
Objects (type: "object" or type: "nested"} are different. An object
like:
{ foo: { bar: "text"}}
is flattened to something like { "foo.bar": "text" }
There IS no "foo" field in the Lucene index. So the "index" parameter
has no meaning at this level.
By setting { enabled: false} at the object level, you are saying: "don't
process anything below this point". This is a good way of storing any
data structure in your object, without indexing any of it. If the data
structure changes completely, you won't get field-type errors, because
no fields are being indexed.
Consider, for example, storing session data. Session data could consist
of anything. We don't want it to be searchable, we just want to store
it. So setup the "session" type as:
{ "session": {
"properties": {
"data": {
"type": "object",
"enabled": false
},
"date": { "type": "date"}
}
}
_source is an optimized field stored in Lucene that ES manages for
you. It's very efficient to store and retrieve. I hear you that it
seems intuitive it would be slower to deserialize a field full of all
the fields' data rather than a single field with just what you want,
but the difference is so small you will probably feel the pain
somewhere else before you ever see it there (namely IO).
For 99.9% of cases, _source is performant enough that its convenience
outweighs selectively storing. It's compressed in a binary format
that is really fast and really small. It also enables you to be able
to reindex data easily. We recommend you use it until you have a
measurable need to not use it. Often it's a premature optimization
to not use it.
For each stored field that you retrieve you pay a penalty of up to 5ms.
Decompressing and parsing the _source field is generally much faster
than this. To give you an idea of how cpu vs disk compare, look at
these numbers from Google:
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
So setting fields to stored seldom makes sense. Just use the _source
field, unless you can demonstrate that, for your particular use case,
storing a field separately is more efficient.
clint
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.