What's the difference between "enabled," "index," and "store?"

I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?

I generally use ES to retrieve the full _source for an indexed document,
but generally only need to search a few fields. So ... to maintain tight
control over index size and speed, I generally create my mappings with
everything turned off and then turn things on as needed in my mappings.

What will be the difference between the way I've mapped the following
properties of the company type?

{
"company" : {
"type" : "object",
"include_in_all" : false,
"index" : "no",
"enabled" : false
"path" : "full",
"dynamic" : true,
"store" : "no",
"properties" : {
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed",
"enabled" : true },
"description" : { "type" : "string", "index" : "analyzed" }
"free_text" : { "type" : "string", "enabled" : true }
"more_text" : { "type" : "string", "store" : "yes" }
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The enabled attribute only applies to various ElasticSearch
specific/created fields such as _index and _size. User-supplied fields do
not have an "enabled" attribute.

Store and index are different. Store means the data is stored by Lucene
will return this data if asked. Stored fields are not necessarily
searchable. By default, fields are not stored, but full source is. Since
you want the defaults (which makes sense), simply do not set the store
attribute.

The index attribute is used for searching. Only indexed fields can be
searched. The reason for the differentiation is that indexed fields are
transformed during analysis, so you cannot retrieve the original data if it
is required. You do not want the default of "analyzed" for all fields, so
you need to disable it for the ones you do not want to search on
(free_text, more_text).

Many of the attributes you set for the "company" document type are not
valid document type attributes, The items under the right-hand-size Fields
section listed here, http://www.elasticsearch.org/guide/reference/mapping/,
are valid for document types.

Cheers,

Ivan

On Tue, Feb 12, 2013 at 2:35 PM, Brian Jones tbrianjones@gmail.com wrote:

I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?

I generally use ES to retrieve the full _source for an indexed document,
but generally only need to search a few fields. So ... to maintain tight
control over index size and speed, I generally create my mappings with
everything turned off and then turn things on as needed in my mappings.

What will be the difference between the way I've mapped the following
properties of the company type?

{
"company" : {
"type" : "object",
"include_in_all" : false,
"index" : "no",
"enabled" : false
"path" : "full",
"dynamic" : true,
"store" : "no",
"properties" : {
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed",
"enabled" : true },
"description" : { "type" : "string", "index" : "analyzed" }
"free_text" : { "type" : "string", "enabled" : true }
"more_text" : { "type" : "string", "store" : "yes" }
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Brian Jones wrote:

I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?

If _index is enabled, your doc will store where it's indexed in a
field in the doc.

"index" in the mapping refers to the analysis of the field data
during indexing operation. It determines how text is tokenized and
filtered.

"store" allows a field to be stored in the Lucene document as its own
Field. This is traditionally how you would store data in a
Lucene-based search engine.

_source is an optimized field stored in Lucene that ES manages for
you. It's very efficient to store and retrieve. I hear you that it
seems intuitive it would be slower to deserialize a field full of all
the fields' data rather than a single field with just what you want,
but the difference is so small you will probably feel the pain
somewhere else before you ever see it there (namely IO).

For 99.9% of cases, _source is performant enough that its convenience
outweighs selectively storing. It's compressed in a binary format
that is really fast and really small. It also enables you to be able
to reindex data easily. We recommend you use it until you have a
measurable need to not use it. Often it's a premature optimization
to not use it.

-Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To add to what Drew has said:

I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?

{enabled: false} is different from { index: no}.

String fields accept: { index: no|not_analyzed|analyzed}:

  • no: don't index the string
  • no_analyzed: index the string exactly as passed in
  • analyzed: first analyze the string, then index the resulting
    tokens

Other scalar values, eg number, date etc accept: {index: no|analyzed}
where "analyzed" really means "yes". There is no analysis phase for
non-string fields, so not_analyzed vs analyzed is meaningless. We
either index the value or we don't.

Objects (type: "object" or type: "nested"} are different. An object
like:

{ foo: { bar: "text"}}

is flattened to something like { "foo.bar": "text" }

There IS no "foo" field in the Lucene index. So the "index" parameter
has no meaning at this level.

By setting { enabled: false} at the object level, you are saying: "don't
process anything below this point". This is a good way of storing any
data structure in your object, without indexing any of it. If the data
structure changes completely, you won't get field-type errors, because
no fields are being indexed.

Consider, for example, storing session data. Session data could consist
of anything. We don't want it to be searchable, we just want to store
it. So setup the "session" type as:

{ "session": {
"properties": {
"data": {
"type": "object",
"enabled": false
},
"date": { "type": "date"}
}
}

_source is an optimized field stored in Lucene that ES manages for
you. It's very efficient to store and retrieve. I hear you that it
seems intuitive it would be slower to deserialize a field full of all
the fields' data rather than a single field with just what you want,
but the difference is so small you will probably feel the pain
somewhere else before you ever see it there (namely IO).

For 99.9% of cases, _source is performant enough that its convenience
outweighs selectively storing. It's compressed in a binary format
that is really fast and really small. It also enables you to be able
to reindex data easily. We recommend you use it until you have a
measurable need to not use it. Often it's a premature optimization
to not use it.

For each stored field that you retrieve you pay a penalty of up to 5ms.
Decompressing and parsing the _source field is generally much faster
than this. To give you an idea of how cpu vs disk compare, look at
these numbers from Google:

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

So setting fields to stored seldom makes sense. Just use the _source
field, unless you can demonstrate that, for your particular use case,
storing a field separately is more efficient.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

Just to add the source of the table Clinton quoted, it's from

Jeff Dean, "Designs, Lessons and Advice from Building Large Distributed
Systems",

Best regards,

Jörg

Am 13.02.13 11:23, schrieb Clinton Gormley:

To add to what Drew has said:

I can't wrap my head around the difference between the field attributes
"enabled," "index," and "store." How is setting a field to { "enabled" :
false }, different than setting it to { "index" : "no" }? What part does
"store" play in all this? Is { "store" : "yes" } the same as { "index" :
"not_analyzed" }?
{enabled: false} is different from { index: no}.

String fields accept: { index: no|not_analyzed|analyzed}:

  • no: don't index the string
  • no_analyzed: index the string exactly as passed in
  • analyzed: first analyze the string, then index the resulting
    tokens

Other scalar values, eg number, date etc accept: {index: no|analyzed}
where "analyzed" really means "yes". There is no analysis phase for
non-string fields, so not_analyzed vs analyzed is meaningless. We
either index the value or we don't.

Objects (type: "object" or type: "nested"} are different. An object
like:

{ foo: { bar: "text"}}

is flattened to something like { "foo.bar": "text" }

There IS no "foo" field in the Lucene index. So the "index" parameter
has no meaning at this level.

By setting { enabled: false} at the object level, you are saying: "don't
process anything below this point". This is a good way of storing any
data structure in your object, without indexing any of it. If the data
structure changes completely, you won't get field-type errors, because
no fields are being indexed.

Consider, for example, storing session data. Session data could consist
of anything. We don't want it to be searchable, we just want to store
it. So setup the "session" type as:

{ "session": {
"properties": {
"data": {
"type": "object",
"enabled": false
},
"date": { "type": "date"}
}
}

_source is an optimized field stored in Lucene that ES manages for
you. It's very efficient to store and retrieve. I hear you that it
seems intuitive it would be slower to deserialize a field full of all
the fields' data rather than a single field with just what you want,
but the difference is so small you will probably feel the pain
somewhere else before you ever see it there (namely IO).
For 99.9% of cases, _source is performant enough that its convenience
outweighs selectively storing. It's compressed in a binary format
that is really fast and really small. It also enables you to be able
to reindex data easily. We recommend you use it until you have a
measurable need to not use it. Often it's a premature optimization
to not use it.
For each stored field that you retrieve you pay a penalty of up to 5ms.
Decompressing and parsing the _source field is generally much faster
than this. To give you an idea of how cpu vs disk compare, look at
these numbers from Google:

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

So setting fields to stored seldom makes sense. Just use the _source
field, unless you can demonstrate that, for your particular use case,
storing a field separately is more efficient.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Great answers. Thanks.

A few more clarifying questions as my mappings are still producing some
confusing ES behavior:

  1. ) In cases where I'll be pushing a lot of data into an object that I
    don't want indexed, will the mapping below "turn all indexing off" at the
    root "company" level? Can I then simply specifiy mappings for data that I
    want to index / search by setting index to analyzed / not_analyzed, setting
    enabled to true for objects, and include_in_all to true?

  2. ) In the mapping below, do I need to set enabled to true for objects
    within the company object if I want to index their properties, or can I
    simply set their properties to index. See the company.contacts object
    below.

  3. ) In the mapping below, what is the deal with nested object types? When
    I set enabled to false for the company object, setting enabled to true for
    the nested objects does not seem to enable them. Their document is still
    missing from the index, and they are not searchable. See the
    company.business_types nested object below.

{
"company" : {
"type" : "object",
"enabled" : false,
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 5.0 },
"not_analyzed" : { "type" : "string", "index" : "not_analyzed" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "boost" : 3.0 },
"cage_code" : { "type" : "string", "index" : "not_analyzed" },
"logo_file_url" : { "type" : "string" },
"contacts" : {
"type" : "object",
"enabled" : true,
"properties" : {
"title" : { "type" : "string", "index" : "not_analyzed" },
"description" : { "type" : "string", "index" : "analyzed" },
"is_primary" : { "type" : "boolean" },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" },
"name" : { "type" : "string" },
"email" : { "type" : "string" },
"phone" : { "type" : "string" },
"address" : { "type" : "string" },
"geolocation" : { "type" : "geo_point", "lat_lon" : true }
}
},
"business_types" : {
"type" : "nested",
"enabled" : true,
"include_in_root" : true,
"properties" : {
"title" : {
"type" : "multi_field",
"fields" : {
"title" : { "type" : "string", "index" : "analyzed", "include_in_all" :
true, "boost" : 5.0 },
"not_analyzed" : { "type" : "string", "index" : "not_analyzed" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all"
: true },
"score" : { "type" : "integer", "index" : "analyzed" },
"verified" : { "type" : "boolean", "index" : "analyzed" },
"certified" : { "type" : "boolean", "index" : "analyzed" }
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya

A few more clarifying questions as my mappings are still producing
some confusing ES behavior:

  1. ) In cases where I'll be pushing a lot of data into an object that
    I don't want indexed, will the mapping below "turn all indexing off"
    at the root "company" level? Can I then simply specifiy mappings for
    data that I want to index / search by setting index to analyzed /
    not_analyzed, setting enabled to true for objects, and include_in_all
    to true?

Setting {enabled: false} at the root level means that it will not look
at any fields in that document. Period. You'd need to have the root
enabled, then disable objects (enabled:false) or fields (index:no)
within the company object.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think the feature I've been looking for is Dynamic Templates found at the
bottom of the page here:
http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html

My use case is that I'm pushing fairly large _source documents into ES ( to
avoid extra calls to another db after search ), I only need to search on a
few fields, and the _source doc will be changing frequently. I'd rather
not have to update my mappings with new fields to "not index" if I can
avoid it, and it seems like Dynamic Templates will let me do this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.