Indexing performance with doc values (particularly with larger number of fields)

This might be more of a Lucene question, but a quick google didn't throw up
anything.

Has anyone done/seen any benchmarking on indexing performance (overhead)
due to using doc values?

I often index quite large JSON objects, with many fields (eg 50), I'm
trying to get a feel for whether I can just let all of them be doc values
on the off chance I'll want to aggregate over them, or whether I need to
pick beforehand which fields will support aggregation.

(A related question: presumably allowing a mix of doc values fields and
"legacy" fields is a bad idea, because if you use doc values fields you
want a low max heap so that the file cache has lots of memory available,
whereas if you use the field cache you need a large heap - is that about
right, or am i missing something?)

Thanks for any insight!

Alex
Ikanow

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0361eda4-ab39-4536-b91a-ccb710921edd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Would be a nice benchmark to run (and if you find hotspots/slow things
to go improve in lucene...)!

The data structures for docvalues are less complex than the data
structures for the inverted index.

I've enabled docvalues for many fields as you suggest in the past, and
in my tests the time for e.g. segment merging was still dominated by
the inverted index (terms dict, postings lists, etc), as I had all the
fields indexed for search, too. But nothing is free: some of this
stuff is data-dependent so you have to test.

About the heap, you are right, its probably best to adjust your heap
accordingly if you are using dovalues.

On Sun, Mar 23, 2014 at 10:01 PM, Alex at Ikanow apiggott@ikanow.com wrote:

This might be more of a Lucene question, but a quick google didn't throw up
anything.

Has anyone done/seen any benchmarking on indexing performance (overhead) due
to using doc values?

I often index quite large JSON objects, with many fields (eg 50), I'm trying
to get a feel for whether I can just let all of them be doc values on the
off chance I'll want to aggregate over them, or whether I need to pick
beforehand which fields will support aggregation.

(A related question: presumably allowing a mix of doc values fields and
"legacy" fields is a bad idea, because if you use doc values fields you want
a low max heap so that the file cache has lots of memory available, whereas
if you use the field cache you need a large heap - is that about right, or
am i missing something?)

Thanks for any insight!

Alex
Ikanow

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0361eda4-ab39-4536-b91a-ccb710921edd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZVGxGcM_QrFHEXsaa%3DQcH_Er_h1s4LgBQDE0kU7c%2Bi2JQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.