Documents size vs. indexing size?

We have an existing Kibana/Elastic setup ingesting logs from our infrastructure.

I'm trying to determine what the actual documents size is, e.g. I want to exclude indices. Basically, what is the volume of logs coming out of our infrastructure.

It's unclear whether I can get that value from Kibana. I'm looking at the JSON under Stack Management > Index Management > (index) > Stats but it's unclear whether the values I want are even available there. I'm assuming stats > store > size_in_bytes is the total value (documents+indexes). Any way to get just the documents size themselves?

Or can I get this from filesizes on the node Elastic node itself somehow?

Hi @Denis_Haskin Welcome to the commuity.

Great questions... a bit of your terminology is a tiny off but your are basically on

When a JSON Document is ingested into Elasticsearch it is Indexed (verb) into an Index (noun)
At index time (the verb) a number of things can / do happen.

In short the original document is stored in the _source field and the fields that you index to make available for search, aggregations etc are processed and stored into various data structures. fields, doc values etc..etc. to add a little extra confusion you can also not index a field (they are by default but you can turn this off) which makes that field unavailable to be searched or aggregated on but can be returned in the results as an available field.

When you search in Elasticsearch it searches agains the indexed fields not the raw _source field which is stored but not indexed :slight_smile: but then is available in the result.

Clear as mud right? :slight_smile:

All that together is contained in the index (noun) and contributes to the overall index size.

Sometime there is a sense to try to think like an RDBMS where the index refers to the specfic structure that allows searching..

You can do a number of things to drop fields in the _source, or drop all the _source (careful) you can look here for other disk tuning suggestions.... there are alway tradeoffs

You can run

POST /my-index-000001/_disk_usage?run_expensive_tasks=true

To get great detail about the storage consumed by your index.

Look for the _source fields to determine the size...

....
     },
      "_source" : {
        "total" : "2.9gb",
        "total_in_bytes" : 3174061182,
        "inverted_index" : {
          "total" : "0b",
          "total_in_bytes" : 0
        },
        "stored_fields" : "2.9gb",
        "stored_fields_in_bytes" : 3174061182,
        "doc_values" : "0b",
        "doc_values_in_bytes" : 0,
        "points" : "0b",
        "points_in_bytes" : 0,
        "norms" : "0b",
        "norms_in_bytes" : 0,
        "term_vectors" : "0b",
        "term_vectors_in_bytes" : 0
      },
...

If you want to know the avg bytes per stored document on disk run

# Get the index stats
GET _cat/indices/filebeat-7.15.2-2022.06.02-000185/?v&s=pri.store.size:desc&bytes=b

# Results
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   filebeat-7.15.2-2022.06.02-000185 0UJn0UA7Qcutg-rc8UPhlQ   1   1    5254848            0 4986229319     2516232571

For Primary
Avg Bytes / Doc = pri.store.size / docs.count = 2516232571 / 5254848 = ~479 bytes / doc

After you are done writing to index you can forcemerge with an ILM policy it to optimize it a bit more.

Hope this helps...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.