Stored Fields (by default)?


(davrob) #1

Hi,

For the purpose of performance (serverside and browser side) I'm
limiting the number of fields returned, by using
SearchRequestBuilder.addFields(), I then return results in the results
using SearchHit.field().

The strange thing is that most of my fields are being returned even
though I haven't marked them as "stored":true in my explicit mapping.

Are some fields stored by default, the docs seem to suggest not
http://www.elasticsearch.org/guide/reference/mapping/core-types.html ?

Currently, I'm going through my list of fields one-by-one, seeing if
it gets returned by SearchHit.field(), and will try putting
"stored":true explicitly on these fields, to see if they suddenly get
returned by the call.

Best Regards,

David.


(Benjamin Devèze) #2

Do you have _source enabled (it is enabled by default)? Fields content can
be retrieved from _source even if the field itself is not stored.


(davrob) #3

Hi Benjamin,

Yes I do have source by default, but for performance reason I don't
want to get source and parse the fields.

-David.

On Sep 19, 2:18 pm, Benjamin Devèze benjamin.dev...@gmail.com wrote:

Do you have _source enabled (it is enabled by default)? Fields content can
be retrieved from _source even if the field itself is not stored.


(Clinton Gormley) #4

Hi David

Yes I do have source by default, but for performance reason I don't
want to get source and parse the fields.

This is usually a false economy. Lucene needs to do a disk seek for each
field that it returns, as opposed to just one for the _source field.

Usually the only time it makes sense to use separately stored fields
instead of the _source is when you have very large docs, and you only
want (eg) a last_modified date out of your doc.

clint


(davrob) #5

Hi Clint,

That's interesting, I had assumed from this post
http://elasticsearch-users.115913.n3.nabble.com/profiling-shows-large-cpu-usage-for-scripts-tp2937899p2939235.html
that if I was needing about 20 columns, out of the 70 or so I have
indexed I should use the field value, certainly when I only need to
return 5 or so fields I find the time to query is reduced by about
half.

In this case, 20 or so fields, I'm in two minds, whether to get the
_source and return it more-or-less as it is to the browser. Or
whether to parse the source into a smaller dataset, giving the browser
less to parse and reducing network time.

thanks for you insight about the disk seek.

-David.

On Sep 19, 4:09 pm, Clinton Gormley cl...@traveljury.com wrote:

Hi David

Yes I do have source by default, but for performance reason I don't
want to get source and parse the fields.

This is usually a false economy. Lucene needs to do a disk seek for each
field that it returns, as opposed to just one for the _source field.

Usually the only time it makes sense to use separately stored fields
instead of the _source is when you have very large docs, and you only
want (eg) a last_modified date out of your doc.

clint


(Shay Banon) #6

From the other thread, the fact that something takes 30% cpu out of the
execution time does not reflect if its slow or not :). 30% time taken from
20ms is not that much, for example ;).

As clinton said, most times, it makes sense to just get the _source, with
the exception of very large single fields. Assuming you have no stored
fields, then asking for specific fields will cause them to be parsed and
extracted from the source. This will cause the source to be parsed and for
the fields to be extracted. On the other hand, if you just ask for the
_source, it will be loaded and returned as is all the way back.

On Mon, Sep 19, 2011 at 8:09 PM, davrob2 daviroberts@gmail.com wrote:

Hi Clint,

That's interesting, I had assumed from this post

http://elasticsearch-users.115913.n3.nabble.com/profiling-shows-large-cpu-usage-for-scripts-tp2937899p2939235.html
that if I was needing about 20 columns, out of the 70 or so I have
indexed I should use the field value, certainly when I only need to
return 5 or so fields I find the time to query is reduced by about
half.

In this case, 20 or so fields, I'm in two minds, whether to get the
_source and return it more-or-less as it is to the browser. Or
whether to parse the source into a smaller dataset, giving the browser
less to parse and reducing network time.

thanks for you insight about the disk seek.

-David.

On Sep 19, 4:09 pm, Clinton Gormley cl...@traveljury.com wrote:

Hi David

Yes I do have source by default, but for performance reason I don't
want to get source and parse the fields.

This is usually a false economy. Lucene needs to do a disk seek for each
field that it returns, as opposed to just one for the _source field.

Usually the only time it makes sense to use separately stored fields
instead of the _source is when you have very large docs, and you only
want (eg) a last_modified date out of your doc.

clint


(system) #7