fielddata/docvalues are just columns across your documents. so if you have a single valued integer field for an index of 5 million docs, think of it conceptually as int[5_000_000]. a string field is also like an int[5_000_000], just populated with term IDs (ordinals) that can be used to do sort/range/etc operations as-is... and at the end there is a dictionary to map them back to values. So lets ignore the difference for this discussion, integers are simpler to think about.
if you have 5 million documents, with a multi-valued integer field, with avg of 10 values per document, its really like an int[50_000_000], and actually another datastructure on the side to find the start/end per doc (as some docs might have 6 values, another have 10, another only 2, and so on).
so the multi-valued types really must be used carefully, otherwise in that case its 10x slower (docs * values). in a lot of cases, they are still ok, because of two reasons:
- the user is adding each of these values and aware that they are doing this.
- the values within a doc are sorted, accessing min/max/median is O(1) time, and this means doing operations like sorting still only have to deal with 5M values, not 50M.
on the other hand, if we were to just analyze the content automatically, and populate values from the analysis chain, its really trappy: the user is unaware of how many values are being added per document. The day we allow this is the day some simpleton opens a bug complaining about how docvalues is horrible because they turned it on for an edge-ngram field. That's why the docvalues api makes you add each value explicitly rather than allowing indexwriter to populate them from the analysis chain automatically.
also, there is little value in doing this IMO. lets say we allowed it, ok now you have a column-wide field with a bunch of unique terms (as string values are sorted and deduplicated: they lose both original order and frequency), what will you do with that? you can't get the most common terms without having to go back to some other data structure like the term dictionary to recover the original term frequency, and now the whole thing blows up, not just 50 million things being processed, but probably more like 100s of millions of random accesses to boot (50 million ordinal -> term lookups + 50 million term dictionary lookups + 50 million seeks to the postings lists + 50 million advance() calls to get to the doc ....)
so to summarize: fielddata might let you do it, but that does not mean its a good idea for docvalues. its been a goal to keep docvalues from having trappy behavior.
I dont know what you are trying to do, but this does not sound to me like the right datastructure.