Elastic Search document fields and values statistics?

I am working on a Python program that gives me statistics across all documents in a given Elastic Search index. Docs in our indexes have many fields that are not specifically mapped in the ES settings. They are just treated as opaque text strings (and we don't search or aggregate by them).

Here's what the script does in a nutshell

  • It downloads every document in the given index (match_all without any selectors)
  • It keeps a list of all fields that appear at least once in any document, in dot notation: FirstLevel.SecondLevelArray[..].ThirdLevel
  • For each field, it counts how many docs have that field
  • For each field, it lists the top 100 values that this field has across all docs (incl. how often that value appears in that field across all docs)

Sample output
FirstLevel.SecondLevelArray[].ThirdLevel appears 9423 times. Top values: 1010 times Value1, 900 times Value2, 423 times Value3

This is a pretty handy tool for us to audit the documents we have in our index.

Naturally, getting all docs out of an index and then doing those counts can be very time-consuming. So I was wondering:

  • Is there a built-in feature in ES that would give me such stats without having to do this myself?
  • If not, instead of doing "match_all" to download all docs, is there a way to get a random "sample" of N documents, e.g. only return every 10th or 100th document? I read through random_sampler aggregation, but that relies on aggregations by certain fields, which I don't want. I want a sample set of the entire document.