Design of data structure: one big index vs many smaller indexes

Hi there,
we are facing a design problem in the place I work. Let's say that we have many data structures that contain documents, this data structure can be of different forms, with fields that are generated by the user of a data structure. Then, inside this data structure, there are the documents inserted that are also indexed and searched.
example:

data structure A:
- name: string
- id: string
- date_of_birth: date

Data strcuture B:
- name: string
- id: integeger
- age: integer

now, data A and data B can contain from 0-N documents.

Should we create an index per data structure? or try to fill all data structures in a big index? or index per data type? (thus create an index per each data type, that's having ~10 indexes) or what else?

So far we used an index per data structure, but we end up having a huge amount of indexes and shard in a single node >1000 indexes (is there any limitation or suggestion in the number of index per node? note that the queries are never cross-indexes). so far the queries are rather fast, creating an index takes a lot of time (±10 sec)

having a big index is problematic because any new structure we add fields and having many fields in an index is not good (is written somewhere in the doc, I can't recall the page)

having an index per type of data (one for integer, string etc) may work, but then if the search involves more than a single query it becomes complicated (ex: search over B where id>1 age<20 and name starts with "alb*")

any suggestion or reference to understand that?

I'm a bit confused by this:

search over B where id>1 age<20 and name starts with "alb*"

How can this work? I mean how can you find any document that matches if one part of the documents only have the age field and the other ones only have the name field?

the point is, over B that query is easy if it's in a single index. if i split in 3 indexes that becomes complicated. but that's not the real deal of the whole question. it's just a small use case of a single approach.

If it's a single index with 2 different documents, I don't see how this can work.
If it's a single index with one single document containing both values, then I understand how it can match.
But then I don't understand how you can split the same document in different indices. Do you mean that you want to "join" indices? In which case, we don't support that.

yes, was join indices, but then that part is discarded as a solution.

digging more into the documentation it seems that we are facing a problem with the cluster-state. we have more than a thousand indexes and the state must be kept in memory. the solution of having a single index with all the fields will not simplify anything at this point, not for the cluster state and for sure not for the performances since the query will be more complicated. am I right?

any idea that we can use? is it possible to put indexes to "sleep" and "wake" them only when needed so they will not use heap memory?

Is there a limit of indexes or shard per node to be kept?

Yes. Note that if you have a "big" cluster you can dedicate 3 nodes to be master-elligible only so you won't suffer from HEAP pressure for both data and cluster state.

the solution of having a single index with all the fields will not simplify anything at this point, not for the cluster state and for sure not for the performances since the query will be more complicated. am I right?

The cluster will be lighter than having multiple indices.

How much data in total you are indexing? In gigabytes.

How many different fields you have in total? What is a typical document you have?

You can close and index and then open it when needed but that's not a fast operation. And it has some drawbacks when nodes are leaving or joining back the cluster from another physical machine.

Is there a limit of indexes or shard per node to be kept?

Yes. That's why I'm asking for the total size of your dataset.

What is the output of:

GET /_cat/indices?v
GET /_cat/health?v
GET /_cat/nodes?v

The point is that the indexes are very light in terms of data, from few kb to some mega. for example the biggest index has ~100K docs for a mere 10mb of data. the rest is in average 1000docs (in production). so we don't reach GB of data. that's the point.

in the test environment, we have way more indexes (ideally we should have, but right now we just delete all of them and creating on demand due to problems) which have way fewer documents and size. let's say that test is 10x or 20x (at least) the size of the production in term of indexes but with fewer size and documents (that the data are here below)

we have single node right now

for the operation:
indexes explained above, but (just few lines), in prod i counted 1134 lines in the file

green open 9a355e 5 0    588   46 123.2kb 123.2kb
green open b60ae2 5 0     10    0  49.3kb  49.3kb
green open ff0e85 5 0      6    0  29.1kb  29.1kb
green open fdadc7 5 0      2    9  11.7kb  11.7kb
green open 8d7s1e 5 0      2    0   7.8kb   7.8kb
green open 1d6a3a 5 0     10   52  49.1kb  49.1kb

(in test env we have shard of 2)
health

epoch      timestamp cluster status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1548780313 17:45:13  cluster   green        1         1   5358 5358    0    0        0             0                  -                100.0%

nodes

host      ip        heap.percent ram.percent load node.role master name
127.0.0.1 127.0.0.1           35          66 0.12 d         *      node-1

5358 shards is definitely too much.
First of all, one shard per index is enough. It will reduce a lot the cluster state.
What is the RAM size and the HEAP size?

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

ram is 8
heap is 3

there's no way I can change the sharding once made, right?

the point is probably we structure the indexing, in the senes that we allow our customer to create "folders" where they store "documents" and for each folder we have an index to search over the docs, but i can't figure out a solution to not have a 1-1 mapping since every doc can be of different type.

there's no way I can change the sharding once made, right?

Have a look at the Shrink API.

the point is probably we structure the indexing, in the senes that we allow our customer to create "folders" where they store "documents" and for each folder we have an index to search over the docs, but i can't figure out a solution to not have a 1-1 mapping since every doc can be of different type.

So you don't have any control on documents that are indexed by the end user, is that right?

not directly, they can create documents as they like using various type (base types: int, float, date, string, an array of those). however, if we find a way to have a consist conversion we can create the documents in the indexes as we like, yet, we have to be able to execute queries and the like over the data afterward.

for the shrink API, we are using an old 2.4, we are going to migrate soon, but that's not an easy task to do, so we probably recreate the index from scratch, and while doing so we were looking if there's a better way to do the index compared to the current approach.

This is interesting because we have a similar use case. We use ES as a denormalized graph database (using JSON-LD) , and we have about 75 different object types. In ES 1.7 these were all in one big index, but when we migrated (currently at 5.6) we ran into an issue where we had normalized fields in one type that were strings (links) and denormalized fields that were objects (JSON-LD links followed and bloated into objects). So our solution was to put each object in a distinct index. Fast forward a couple years and we are having performance/search issues ... possibly due to having a large number of very small indexes and shards.

Quick update, we are designing a new way of indexing the things keeping a mapping between user-schema and elastic-schema . I've still some questions if someone has answers

We create a single index (or a single index per user) where we have 1-N fields for each type of data that can be stored.
In our system we maintain a mapping between user-schema and elastic-schema.
Such that we can have fewer indexes, maybe even 1 single big index.

schemaA:
Name: string
Surname: string
Age: int

schemaB
Id: int
Username: string
Date: datetime

The mapping is then

schemaA:
Name: ->; string_1
Surname -> string_2
Age -> int_1

schemaB
Id -> int_1
Username -> string_1
Date -> datetime_1

thus while the user-schema are the first in elastic we will have

String_1
String_2
Int_1
Datetime_1
schema_id: used to map user schema

the questions are:

  • is it better to have a single index or an index per user? in the first case, we can have a huge amount of fields, i guess anyway less than 100 (is the max for each data type across all indexes). in the second the fields will be way less but more indexes. the index-per-user is probably safer to manage, especially if we have to delete data
  • adding/deleteting/updating a document takes the different time if the index has a lot of docs or no? this is because we can have 1 index or 1 index per user
  • adding a new field if someone has 3 strings to index (thus we need String_3) is free, correct?
  • is it a problem to have many documents that do not have all the filelds filled in?
  • any problem that we can't foresee? or any idea?

does anyone have advice for the post above this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.