Hi Elastic fanatics,
Last couple of years I was able to find everything I wanted to know in either the official documentation or in this forum. For the first time I stumbled upon something I'm not sure about but also can't really find somewhere. I really hope someone can help me out here. Lets go:
Use case
We want some really large fields (99% of a 2000kb message) not being indexed. How does this affect the storage and memory requirements of Elasticsearch? We still keep all the data in _source, but only limit the way this data can be retrieved with some keywords instead of indexing everything.
With our current setup and all incoming data indexed we know exactly how many data and shards a node can handle and the cluster remains happy. We're not really sure what will happen after this change.
Assumption 1
When we don't index 99% of the characters in our messages but we still store the _source, our shards can handle more messages and become larger before they become unstable: we can store more data per node.
Assumption 2
The inverted inverted index in RAM affects the happiness of a node and is based on the inverted index on disc. If the inverted inverted index (RAM) gets too big a node becomes unstable.
Question 1
Are my assumption correct?
If so...
Question 2
How do I find/calculate the inverted index size? Is that a combination of the .doc and .pos files on disc?
Question 3
How do I find/calculate the inverted inverted index size Elastic search stores in memory? I would like to compare this with an index with indexed data and an index where most of the data isn't indexed.
If not..
Question 4
If assumption 1 is not right, so you can't store more messages/data in a shard if you don't index 99% of the characters, can we then get away with less RAM?
If my assumptions are totally wrong and my questions don't make any sense, please also advice in what would be the right direction then.