Improve mapping performance on Elasticsearch

(yarden) #1

My elastic cluster contains indices with giant mapping files. This is due to the fact that some of my indices contain up to 60k different fields.

To elaborate a bit about my setup, each index contains information from a single source. Each source has several types of data (what I'll call layers) which are indexed as different types in the index corresponding to the source. Each layer has different attributes (20 in average). To avoid field name collision they are indexed as "LayerId_FieldId".

I'm trying to find a way to reduce the size of my mapping (as to my understanding, it might cause performance issues). One option is having one index per layer (and perhaps spreading large layers over several indices, each responsible for a different time segment). I have around 4000 different layers indexed right now, so lets say that in this method I will have 5000 different indices. Is elastic fine with that? What should I be worried about (if at all) with such a large number of indices, some of them very small (some of the layers have as few as 100 items)?

A second possible solution is the following. Instead of saving a layer's data in the way it is sent to me, for example:

"LayerX_name" : "John Doe",
"LayerX_age" : 34,
"LayerX_isAdult" : true,

it will be saved as :

"value1_string" : "John Doe",
"value2_number" : 34,
"value3_boolean" : true,

In the latter option, I will have to keep some metadata index which links the generic names to the real field names. In the above example, I need to know that for layer X the field "value1_string" corresponds to "name". Thus, whenever I receive a new document to index, I have to query the metadata in order to know how to map the given fields into my generic names. This allows me to have a constant size mapping (say, 50 fields for each value type, so several hundred fields overall). However, this introduces some overhead, but most importantly I'm feeling that this basically reduces my database to a relational one, and I lose the ability to handle documents of arbitrary structure.

Some technical details about my cluster:

Elasticsearch version 2.3.5

22 nodes, 3 of them are masters, each node contains 16 Gb of ram, 2 Tb disc storage. In total I currently have 6 Tb of data spread over 1.2 billion docs, 55 indices, and 1500 shards.

I'd appreciate your input on the two solutions I suggested, or any other alternatives you have in mind!

(Mark Harwood) #2

An elasticsearch cluster is scalable in many dimensions but unique field names is not one of them.

Whenever there's a SAAS company trying to squeeze many customers with different data into the same cluster there's potential for issues with numbers of unique fieldnames. One company I met had tried to resolve the issue using tiny nested docs of the form:

"my_custom_properties": [
    { "fieldname": "age", intValue : 22},
    { "fieldname": "name", stringValue : "Fred"},    

The set of fieldnames is bounded but this hit scaling issues with the volumes of nested docs and complicated their queries.
In the end they opted for the scheme you outlined - banks of reserved field names eg. string1, string2, int1, int2. An application-level mapping layer translated customer 1's "name" = string1 and customer 2's "department" was string1. This field-sharing can have screwy effects on relevance ranking but they lived with it.

Our own experiences in running elastic cloud for many customers led us to create administration tools for running dedicated clusters for each customer. This provides isolation but you need tooling to administer that many clusters. This tooling is available to run on your own premises using ECE which may be of interest.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.