Elastic Search document fields and values statistics?

dennisk · May 10, 2024, 9:45pm

I am working on a Python program that gives me statistics across all documents in a given Elastic Search index. Docs in our indexes have many fields that are not specifically mapped in the ES settings. They are just treated as opaque text strings (and we don't search or aggregate by them).

Here's what the script does in a nutshell

It downloads every document in the given index (match_all without any selectors)
It keeps a list of all fields that appear at least once in any document, in dot notation: FirstLevel.SecondLevelArray[..].ThirdLevel
For each field, it counts how many docs have that field
For each field, it lists the top 100 values that this field has across all docs (incl. how often that value appears in that field across all docs)

Sample output
FirstLevel.SecondLevelArray[].ThirdLevel appears 9423 times. Top values: 1010 times Value1, 900 times Value2, 423 times Value3

This is a pretty handy tool for us to audit the documents we have in our index.

Naturally, getting all docs out of an index and then doing those counts can be very time-consuming. So I was wondering:

Is there a built-in feature in ES that would give me such stats without having to do this myself?
If not, instead of doing "match_all" to download all docs, is there a way to get a random "sample" of N documents, e.g. only return every 10th or 100th document? I read through random_sampler aggregation, but that relies on aggregations by certain fields, which I don't want. I want a sample set of the entire document.

Topic		Replies	Views
Terms stats API Elasticsearch	5	1303	December 28, 2016
Returning count on field matches (on a query across all fields) Elasticsearch	3	1787	July 6, 2017
Multiple fields with different values in a same document Elasticsearch	6	593	July 5, 2017
Explanation of each fields of _stats api data Elasticsearch	1	372	July 6, 2017
Elasticsearch: [Python Client] Search does not return documents Elasticsearch	1	238	October 21, 2022

Elastic Search document fields and values statistics?

Related topics