My elastic cluster contains indices with giant mapping files. This is due to the fact that some of my indices contain up to 60k different fields.
To elaborate a bit about my setup, each index contains information from a single source. Each source has several types of data (what I'll call layers) which are indexed as different types in the index corresponding to the source. Each layer has different attributes (20 in average). To avoid field name collision they are indexed as "LayerId_FieldId".
I'm trying to find a way to reduce the size of my mapping (as to my understanding, it might cause performance issues). One option is having one index per layer (and perhaps spreading large layers over several indices, each responsible for a different time segment). I have around 4000 different layers indexed right now, so lets say that in this method I will have 5000 different indices. Is elastic fine with that? What should I be worried about (if at all) with such a large number of indices, some of them very small (some of the layers have as few as 100 items)?
A second possible solution is the following. Instead of saving a layer's data in the way it is sent to me, for example:
"LayerX_name" : "John Doe",
"LayerX_age" : 34,
"LayerX_isAdult" : true,
it will be saved as :
"value1_string" : "John Doe",
"value2_number" : 34,
"value3_boolean" : true,
In the latter option, I will have to keep some metadata index which links the generic names to the real field names. In the above example, I need to know that for layer X the field "value1_string" corresponds to "name". Thus, whenever I receive a new document to index, I have to query the metadata in order to know how to map the given fields into my generic names. This allows me to have a constant size mapping (say, 50 fields for each value type, so several hundred fields overall). However, this introduces some overhead, but most importantly I'm feeling that this basically reduces my database to a relational one, and I lose the ability to handle documents of arbitrary structure.
Some technical details about my cluster:
Elasticsearch version 2.3.5
22 nodes, 3 of them are masters, each node contains 16 Gb of ram, 2 Tb disc storage. In total I currently have 6 Tb of data spread over 1.2 billion docs, 55 indices, and 1500 shards.
I'd appreciate your input on the two solutions I suggested, or any other alternatives you have in mind!