Geo Features & Doc_Values for Analyzed String Fields


(James Macdonald) #1

First, is there any place to look at the history of geo-spatial feature development in Elasticsearch? Something like a condensed changelog. I would like to know the past road map for geo features (and if possible what the next things on the road map are).

Second, I know there are no immediate plans to enable doc_values on tokenized string fields (only fields analyzed with the keyword analyzer). I am curious what the major challenge there is. My specific use case is using a terms aggregation across many, fairly short, text fields in order to get the most common words in that segment of data. But I wonder how implementing doc_values for the tokenized field would be different from implementing normal field data, and what the roadblocks are. I have tried looking for the relevant code in the elastic github, but have not found it.

Thanks!


(Mark Walkom) #2
  1. Nope
  2. Not exactly, we're defaulting to anything that is not analysed. We currently don't support them on analysed values, but it's outside my current pay grade to be able to explain that adequately :stuck_out_tongue:

(James Macdonald) #3

Hi Mark,

Thanks for the reply. I've been digging into the ValuesSource and the ValuesSourceAggregatorFactory code and the corresponding DocValues code from lucene, that stuff is very hard to read, but fairly interesting. I have not found any reasons yet why doc_values cannot be used on analyzed string fields (besides the fact that it is not supported).

I know there must be a good reason why it is not supported, since there have been several GitHub issues on this topic, most recently https://github.com/elastic/elasticsearch/issues/10061, and according to Clinton Gormley said you were discussing the change internally.

If possible, I would love to hear from someone who would be able to explain the reason it isn't supported, if not the roadblocks to implementation in detail. It sounds like there may be major hurdles, but if it is simply an issue of priorities I may be able to help with the implementation.

Thanks!


#4

fielddata/docvalues are just columns across your documents. so if you have a single valued integer field for an index of 5 million docs, think of it conceptually as int[5_000_000]. a string field is also like an int[5_000_000], just populated with term IDs (ordinals) that can be used to do sort/range/etc operations as-is... and at the end there is a dictionary to map them back to values. So lets ignore the difference for this discussion, integers are simpler to think about.

if you have 5 million documents, with a multi-valued integer field, with avg of 10 values per document, its really like an int[50_000_000], and actually another datastructure on the side to find the start/end per doc (as some docs might have 6 values, another have 10, another only 2, and so on).

so the multi-valued types really must be used carefully, otherwise in that case its 10x slower (docs * values). in a lot of cases, they are still ok, because of two reasons:

  • the user is adding each of these values and aware that they are doing this.
  • the values within a doc are sorted, accessing min/max/median is O(1) time, and this means doing operations like sorting still only have to deal with 5M values, not 50M.

on the other hand, if we were to just analyze the content automatically, and populate values from the analysis chain, its really trappy: the user is unaware of how many values are being added per document. The day we allow this is the day some simpleton opens a bug complaining about how docvalues is horrible because they turned it on for an edge-ngram field. That's why the docvalues api makes you add each value explicitly rather than allowing indexwriter to populate them from the analysis chain automatically.

also, there is little value in doing this IMO. lets say we allowed it, ok now you have a column-wide field with a bunch of unique terms (as string values are sorted and deduplicated: they lose both original order and frequency), what will you do with that? you can't get the most common terms without having to go back to some other data structure like the term dictionary to recover the original term frequency, and now the whole thing blows up, not just 50 million things being processed, but probably more like 100s of millions of random accesses to boot (50 million ordinal -> term lookups + 50 million term dictionary lookups + 50 million seeks to the postings lists + 50 million advance() calls to get to the doc ....)

so to summarize: fielddata might let you do it, but that does not mean its a good idea for docvalues. its been a goal to keep docvalues from having trappy behavior.

I dont know what you are trying to do, but this does not sound to me like the right datastructure.


(James Macdonald) #5

That makes sense, and I appreciate the commitment to avoiding trappy options.

What I am trying to do is evaluate weather it is possible/a good idea to do some work (on a forked repo) to allow tokenized string fields in doc values. The reason is that we have moved almost all of our fields from field data to doc values, but we are still memory bound in terms of our required cluster size due to 2 analyzed string fields. We would like to be able to save memory by moving that on to disk as well, but it sounds like the performance issues make this unfeasible.

I am somewhat curious why it is possible to do a terms aggregation (for example) using field data and not with doc values. As far as I know, both are used in memory (doc values uses the fs cache right?), so unless it is a problem with paging lots of data, that means the data structures must be different.

If you don't mind me asking, is there a difference between the way fielddata handles fields with many tokens and what is available in doc values?

Thanks


#6

Well what you described is confusing, again if you have a field with values of B,A,B,C frequency and order are discarded. as far as fielddata/docvalues it becomes A,B,C for the doc and you lose the fact that B occurred twice.

Maybe for your case since documents are short, this does not matter to you, because most terms have a frequency of 1 anyway. But you see how it does not work in general for "find most frequent word"? So when you say its "possible to do a terms aggregation" with fielddata, I'm not convinced it really solves your problem exactly, instead it gives you more of an IDF-type measure.

As far as differences between docvalues and fielddata, conceptually its the same, but behind the scenes the datastructures, compression, etc is very different.


(James Macdonald) #7

I understand that in the way doc_values/fielddata are stored we loose the frequency of terms in a given document. My question then is, how does the terms aggregation recover this data?

Also, since obviously doc_values are not an appropriate format for analyzed strings, but fielddata works very well, would it be feasible to write the fielddata data structure to disk and retrieve it using the file system cache. The reason I ask is that I know that when fielddata is not eager loaded it is possible to have huge cold query latency spikes, obviously this is not the case with doc_values (because the index is already uninverted and a single large data structure is read into the FS cache?). I know this may not be feasible, I suspect because the fielddata structure is much more complex than the doc_values, but I am not sure.

I guess the real question is what is the roadblock/reason stopping people from maintaining fielddata on disk and loading it the same/similar way doc_values are managed?

Thanks!


#8

It doesn't.


(James Macdonald) #9

Ok, I just want to be clear. I currently have a large cluster that uses terms aggregation on an analyzed string field. Currently this uses around 90GB of memory per replica, and is by far the resource that is overloaded first.

I am evaluating if it is possible to move that memory use to disk (and FS cache, similar to how doc_values are loaded). My first thought was to look into why doc_values cannot be used. I totally understand why they cannot, it is a poor data structure for this use case and has major performance problems.

My current thought is evaluating taking the current field data structure and changing the caching behavior from a memory backed Guava cache to a disk backed cache loaded into the FS RAM cache when needed. Again, I know that the people at Elastic are very good, and that you have likely explored these options. I am basically looking to learn from research you have done on the subject before I start working on something that is doomed to failure.

Thanks for all your help.


(system) #10