DocValues in ES 1.0.0.Beta1


(Anantha Govindarajan) #1

Hi

Can any one help me how to use doc_values in Beta1 , also how to verify the
same. I have the following settings in elasticsearch config directory.

in elasticsearch.yml

index.mapping.ignore_malformed: false
index.store.type: mmapfs
index.codec.doc_values_format.my_format.type: disk

in templates section

{
"app_template" : {
"template" : "2",
"order" : 0,
"settings" : {
"index.number_of_shards" : 1,
"index.number_of_replicas" : 1
},
"mappings" : {
"2": {
"properties" : {

"time_stamp" : {"type" : "long", "precision_step": 0, "doc_values_format"
: "my_format"},

"message" : {"type" : "string", "omit_norms" : true, "index" : "analyzed"}
}}}}}

I have created a index with the above configuration and indexed few
documents . While searching i attached a debugger and found by default
doc_value enabled for "_version" field and "time_stamp" field has null
value for docValueType . On looking in index directory _0.cfe file contains
only one "doc value" file (both Lucene45_0.dvd & _Lucene45_0.dvm , for
_version field if i understand correctly) .

Then i removed index.codec.doc_values_format.my_format.type: disk from
elasticsearch.yml and modified app_template as below

"time_stamp" : {"type" : "long", "precision_step": 0, "doc_values_format" :
"disk"} but still the same result.

I indexed the same amount of data in ES 0.90.5(with out DocValue) and
issued a query . The field data stats api shows same amount of memory for
both ES 0.90.5 and ES 1 Beta (with DocValue).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Anantha Govindarajan) #2

DocValues working nicely . Thanks to this blog post.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d06fe017-17d1-4473-b1e8-7e1deeda4d2c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #3

How would you define "working nicely"? Greater QPS? Reduced gc? What is
your metric? :slight_smile:

I am curious about DocValues in storing norms as described by Simon almost
two years ago:
http://blog.trifork.com/2012/01/19/simon-says-single-byte-norms-are-dead/ The
loss of precision affects us somewhat since we take search relevancy
seriously.

Cheers,

Ivan

On Wed, Dec 4, 2013 at 3:53 AM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

DocValues working nicely . Thanks to this blog post.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d06fe017-17d1-4473-b1e8-7e1deeda4d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDj5F5JUUVvPeiDOCjS3V4axV0-POT7jaTeA3Z1R4nF5Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Anantha Govindarajan) #4

Hi Ivan , I meant doc_values files .dvd & .dvm files are getting created
in index.

Initially i tried like following

"time_stamp" : {"type" : "long", "precision_step": 0, "doc_values_format" :
"my_format"}.

Once come across this post i come to know that doc_values_format conf
alone not enough to enable doc_values , also field data format also needs
to be change like "fielddata" : { "format": "doc_values"}. Now my template
looks following,

"time_stamp" : {"type" : "long", "precision_step": 0 , "fielddata" : {
"format": "doc_values"}, "doc_values_format" : "disk"}.

Ivan , Thanks for asking this question , we yet to collect those kind of
metrics , i will share once i achieve those.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d6f241b-69e4-43ec-8fa5-27b0686ebaf7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #5

Hi Anantha,

On Fri, Dec 6, 2013 at 8:54 AM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

"time_stamp" : {"type" : "long", "precision_step": 0 , "fielddata" : {
"format": "doc_values"}, "doc_values_format" : "disk"}.

Ivan , Thanks for asking this question , we yet to collect those kind of
metrics , i will share once i achieve those.

I see that you are using the "disk" doc-values format. This format tries
to make the memory usage as low as possible, at some performance cost. My
opinion is that the "default" doc values format is usually a better
trade-off, since it only loads into memory the data-structures which are
the most important performance-wise (which happen to be rather small).

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j78q8W_9FZONnHCEbvkc8jn6U_ojntRzUJjGqR2e2c9_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Anantha Govindarajan) #6

Hi Adrien, Thanks for your suggestion. We are already facing memory issues
in ES data nodes. Some times GC collections are running more than 2
minutes. If i understand correctly due to GC, the node becomes unavailable
from the cluster , i.e. master mark it as down because node not responding
for all 3 retries with default 30 seconds interval (one & half minutes) ,
thats why we made it as disk doc_value format. Any how let me check with
default doc_value format also.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6888ff88-6181-4128-ab4e-02dc58fb1804%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #7

In our recent experiment (with Solr, not ES, but that doesn't matter here
because DocValues are Lucene goodness, not Solr or ES goodness) we saw a
much lower footprint with DocValues as we've used them.

Check slides 26-28 in


28 has a side by side heap comparison.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wednesday, December 4, 2013 3:03:21 PM UTC-5, Ivan Brusic wrote:

How would you define "working nicely"? Greater QPS? Reduced gc? What is
your metric? :slight_smile:

I am curious about DocValues in storing norms as described by Simon almost
two years ago:
http://blog.trifork.com/2012/01/19/simon-says-single-byte-norms-are-dead/ The
loss of precision affects us somewhat since we take search relevancy
seriously.

Cheers,

Ivan

On Wed, Dec 4, 2013 at 3:53 AM, Anantha Govindarajan <
ananthago...@gmail.com <javascript:>> wrote:

DocValues working nicely . Thanks to this blog post.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d06fe017-17d1-4473-b1e8-7e1deeda4d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/92af6eda-91fa-4835-af24-d174e0c28e57%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Anantha Govindarajan) #8

Hi Adrien ,

"My opinion is that the "default" doc values format is usually a better
trade-off, since it only loads into memory the data-structures which are
the most important performance-wise (which happen to be rather small). "

What are the important data-structures for doc_values ? I am experiencing
improvements in response speed when using *default *instead of disk doc_value
format.

I have enabled default doc_value format for long field(time_stamp). What is
the difference between loading disk doc_value and default doc_value formats
?

Ananth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a9afc58-4d1a-4bcc-887c-1c00d6b39bda%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #9

Hi,

On Wed, Mar 12, 2014 at 2:00 PM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

Hi Adrien ,

"My opinion is that the "default" doc values format is usually a better
trade-off, since it only loads into memory the data-structures which are
the most important performance-wise (which happen to be rather small). "

What are the important data-structures for doc_values ? I am experiencing
improvements in response speed when using *default *instead of disk doc_value
format.

I have enabled default doc_value format for long field(time_stamp). What
is the difference between loading disk doc_value and default doc_value
formats ?

Having the default format performing better than the disk format is
expected. The structures I'm talking about depend on the data type, but for
a long field (or numeric field in general), there are two structures: the
bytes of the numeric data (which are stored sequentially) and an index that
maps doc IDs to the offset where the numeric bytes are stored for that
particular document. For both "default" and "disk", the numeric bytes are
stored on disk, however "default" will store the index in memory while
"disk" will store it on disk. Please note that this index is efficiently
compressed, and should take very little memory, especially if all your
documents have the same number of values.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4k%2Bf24CPftMzJKbROkJ-csMPCgNSyT6dsnaX-O1TR5VA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #10