Hello folks, I'm currently trying to improve the performance of an Elastic cluster, which isn't easy because I'm no expert. The cluster is made of 6 data nodes and 3 master nodes, the 6 data nodes having 64G of memory (~31 of which dedicated to the Java heap), 12 CPUs, and 32TB of storage each. (Note that I'm not in control of the infrastructure.) It currently holds 710 indices, split into 4,210 shards, containing 140,878,512,079 docs, using up 129.36TB.
One thing that stands out when looking at the cluster performances is the heap usage: it's constantly above 90% on all 6 data nodes. This is what GET /_cluster/stats?human&pretty
returns; most specifically, the segments
part:
"segments" : {
"count" : 23252,
"memory" : "145.5gb",
"memory_in_bytes" : 156258619814,
"terms_memory" : "100.9gb",
"terms_memory_in_bytes" : 108396686231,
"stored_fields_memory" : "42.3gb",
"stored_fields_memory_in_bytes" : 45459085936,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "4.5mb",
"norms_memory_in_bytes" : 4786304,
"points_memory" : "1.4gb",
"points_memory_in_bytes" : 1533316923,
"doc_values_memory" : "824.6mb",
"doc_values_memory_in_bytes" : 864744420,
"index_writer_memory" : "15.1mb",
"index_writer_memory_in_bytes" : 15878264,
"version_map_memory" : "13.1kb",
"version_map_memory_in_bytes" : 13490,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : 1571298991618,
"file_sizes" : { }
}
The terms_memory
line is the one that strikes out first. From what I understand, terms are fields for which an inverted index is built, which is then stored in memory.
So in my head, it looks something like:
term1 -> doc1
term2 -> doc1,doc2,doc4
term3 -> doc3
...
Is it reasonably accurate so far?
If it is, I have more questions:
-
Is this index replicated in the memory of each node?
-
Is this index by index, or general? Meaning, if the same term exist in two different indices, which of these situation happens?
situation 1 term1 -> doc1, doc12 situation 2 term1(index1) -> doc1 term1(index2) -> doc12
-
Is it possible to "unindex" existing fields? Or just delete them?
-
How does the number of documents a term appears in affect the memory usage? i.e., if I have a field that appears in 1 document and another that appear in 100 documents, does the latter have a 100x larger memory footprint? Or do they have a similar memory footprint? I well understand that I won't get a numerical answer to this, but I'm still interested in a ballpark.
-
Any other information that you think might help me?
I'm reasonably confident a solution to my problems is to reduce the number of terms, but I'd like to understand better how they work beforehand. Thanks in advance.