Building List Box of Unique Values

I have a Centos7 5 node cluster. I have a proof of concept advanced search gui (.Net/Nest) with a critical requirement of filling in a list box with the existing values (at least a 1000 values) of one of the 20 or 30 fields in the index.

I used the terms aggregation and this worked great when we had up to 25 million documents, with the added benefit of listing the counts also. The performance, however, is slowing down significantly as I've grown the index to 250 million (and have another index growing to 2.5 Billion now).

It seems, an obvious low hanging fruit, to me, that an inverted index architecture should be able to very quickly return the unique values for a field, but in all the searches I've done on the web, they only point to the terms aggregation which counts also, and is not all that fast.

I even found a response that showed "select distinct(color) from someindex" which is NOT the same as a terms aggregation (that sql returns only the list of unique values, not the values and their counts which would be select color, count(*) from someindex.

Thanks, Joe R.

Aggregations process all matching docs sequentially on each shard so in some respects the response times are a function of the number of docs. The operations performed for each doc are highly optimized but for very large numbers of docs this may add up which may be what you are experiencing.
Obviously adding more nodes helps parallelise computation but it might be worth experimenting with reducing the volumes of docs fed into the aggregation tree. This can be done by using a query with a must_not filter [1] consisting of the "known knowns" ie some of the more common terms that account for a lot of the docs in your system. In the case of a logging system you may choose for example to exclude 200s and 404s as expected and well-understood codes and run an aggregation to see what other codes turn up in the minority of docs that aren't represented by these usual suspects.

Haven't tried it but probably worth a shot.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-filter.html

Thank you for your quick response. "Must_not" might help in a few situations but not normally. The data is unpredictable and comes from widely varying sources.

Re: "Aggregations process all matching docs sequentially". We agree that's why it doesn't scale so great (FYI, as good as or better than our current solution at 25 million scale, but lots of room for improvement).

Is it true that the inverted index(es) contains these unique values lists without having to aggregate yet there is no query dsl to access those unique value lists? Might I suggest a new query (? "list_values" ?) to the development team. Is this the place to suggest such an improvement? I just started my account.

We expose some index-level details [1][2] but I'm not sure either of these provide what you are after.
Specialized endpoints would be the place to add such information as the Query DSL is designed for producing streams of documents for top N hits or aggregations, not producing field summaries.

[1]https://www.elastic.co/guide/en/elasticsearch/reference/master/search-field-stats.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#_term_statistics

Thanks again. Re [1] and [2] neither produce a uniques list.

Re: My requirement for a list of unique values (without counts in the millions, even hundreds of millions and faster),

Just to make sure I am not crazy, the following link and cut and paste below show an inverted index:

https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html?q=inverted%20index

Term Doc_1 Doc_2 .... Doc_N

brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer| | X

Is the inverted index shown anything like the physical implementation? Where can I look at that?

I just want "brown, dog, fox, in, jump, lazy, over, quick, summer". (in ES json format)
Terms aggregration is easy to use and returns quickly at a decent scale (25 million on my 5 million cluster) as it returns
brown 2, dog 2, fox 2, in 1, jump 2, lazy 2, over 2, quick 2, summer 1".

but degradation is significant (7-8 seconds) when the counts are over 250 million
brown 2013987, dog 2001234, fox 2842, in 248000000, jump 2091472, lazy 2, over 2202747, quick 224944, summer 1

I cannot grow the cluster (physical space and available servers).

Where do I request something that returns like a Terms Aggregation but that only shows the list, and leverages the inveted index to fill in a list box. (I would want the terms in alpha or numeric order).

Joe R.

The problem with providing a general API that lists all field values is that we would have to deal with scenarios unlike yours where there are many values. This would require:

  1. paginated access to a sorted stream of potentially millions of values
  2. fusion of results from multiple shards (while dealing with issues of ongoing updates)
  3. removal of values that are in the index but all documents are deleted

These are non-trivial problems.

Given your particular use case is to populate a listbox we must assume the number of unique values is small and therefore your application can sensibly cache this set of values rather than re-building it for each page view, no?

Please Consider that I am only looking for the first 1000 to fill in the list box.

Then the user would type and after a delay fill in the list box starting with those characters

OK - still sounds like it would make sense for your app to cache the first 1,000.

Then the user would type and after a delay fill in the list box starting with those characters

If you've not checked it out see the completion suggester API which caters for this use case.

OK, I can see that the Apache Lucene site gets this same question and answers are very similar. It is a choice of balancing different requirements, and that I should use caching when possible to help solve this. I see some reference in the definitive guide but will wait for 2.0 to see if eager global ordinals are still recommended.

Thank you for your support.

So one more possibility, Reading up on entity centric indexes, and yes, I noted the familiar author/presenter name.
eg. https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes

Could I use entity centric indexes to build list of unique values ? It seems like entity centric can do a lot more and would be overkill for this but it would handle it.

Thanks again, Joe R.

Something worth checking out is the es-token-plugin which will return all the tokens in the index. Please note this is unofficial and unsupported, and you may have to adapt it for your specific use case and version of ES.

Interesting option. I've been working some other issues but would like to get back to this one.