Hi @Denis_Haskin Welcome to the commuity.
Great questions... a bit of your terminology is a tiny off but your are basically on
When a JSON Document is ingested into Elasticsearch it is Indexed (verb) into an Index (noun)
At index time (the verb) a number of things can / do happen.
In short the original document is stored in the _source
field and the fields that you index to make available for search, aggregations etc are processed and stored into various data structures. fields, doc values etc..etc. to add a little extra confusion you can also not index a field (they are by default but you can turn this off) which makes that field unavailable to be searched or aggregated on but can be returned in the results as an available field.
When you search in Elasticsearch it searches agains the indexed fields not the raw _source
field which is stored but not indexed but then is available in the result.
Clear as mud right?
All that together is contained in the index (noun) and contributes to the overall index size.
Sometime there is a sense to try to think like an RDBMS where the index refers to the specfic structure that allows searching..
You can do a number of things to drop fields in the _source
, or drop all the _source
(careful) you can look here for other disk tuning suggestions.... there are alway tradeoffs
You can run
POST /my-index-000001/_disk_usage?run_expensive_tasks=true
To get great detail about the storage consumed by your index.
Look for the _source
fields to determine the size...
....
},
"_source" : {
"total" : "2.9gb",
"total_in_bytes" : 3174061182,
"inverted_index" : {
"total" : "0b",
"total_in_bytes" : 0
},
"stored_fields" : "2.9gb",
"stored_fields_in_bytes" : 3174061182,
"doc_values" : "0b",
"doc_values_in_bytes" : 0,
"points" : "0b",
"points_in_bytes" : 0,
"norms" : "0b",
"norms_in_bytes" : 0,
"term_vectors" : "0b",
"term_vectors_in_bytes" : 0
},
...
If you want to know the avg bytes per stored document on disk run
# Get the index stats
GET _cat/indices/filebeat-7.15.2-2022.06.02-000185/?v&s=pri.store.size:desc&bytes=b
# Results
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open filebeat-7.15.2-2022.06.02-000185 0UJn0UA7Qcutg-rc8UPhlQ 1 1 5254848 0 4986229319 2516232571
For Primary
Avg Bytes / Doc = pri.store.size / docs.count = 2516232571 / 5254848 = ~479 bytes / doc
After you are done writing to index you can forcemerge with an ILM policy it to optimize it a bit more.
Hope this helps...