Elasticsearch - compound index or discrete indexes?

Elasticsearch 6.3 (but our ES experience is all from earlier versions)

We have a typical multi tenant cloud app. 5,000 accounts on the cloud app. Each account has 500 to 1,000,000 records/documents.

When we use a single monolithic Elasticsearch index, it is ~800GB in size, and our queries are slow to run -- really slow. Like 5-14 seconds with the data on a hot SSD based server. (We are doing sorts on different columns, and of course only ever want data for a single accountid.)

We know that the data for a single account would give us super fast ES queries, but partitioning into 5,000 indexes seems over the top, and might require a rather large cluster so that no single server has to handle 1,000 indexes.

In my SQL thinking, I keep wanting to make the big monolithic index have a "compound index" aspect (like the accountid), since every query has an accountid prefix, but how best to do this in ES?

How do we find a path forward?

I'd use multi shards + routing (on account Id)
I'd also if possible create may be individual indices for "big" accounts or VIP accounts.

That way you can also dedicate some powerful nodes to the VIP accounts and let the other data goes to the other nodes.

My 2 cents.

David,

Thank you.

If I follow your thinking (and the general thoughts behind routing), what is being said is:

"(a) Multi-shards + routing is like a lightweight approach to (b) discrete indexes. If you think about it, in both a and b require data partitioning logic to be held in the application. But it can be easier to manage and scale a single 'a' than to have thousands of discrete indexes. So build out a data partitioning layer in your app that can handle both a and b, and then tune to taste."

Warm?

My next question would be: Are there any libs out there that have implemented the app layer of this, or am I doing this myself? (No prob either way, just like to see what is being done already.)

Thank you!

I don't think so.

David,

One other question.

How is shards + routing materially better than buckets?

I believe this is the same question as Are contents of a bucket held in the same shard? ?

David,

I am working to deeply understand my options.

Problem statement: With 5000 accounts, my index is 800GB and queries sorting on mult fields are very slow. In my case, I am always sorting by at least three fields: accountid, <something I am interested in>, document-primary-status

I believe the main solution paths are to partition the data by accountid, so that a given query is: (a) running against a smaller index (or shard), and (b) the accountid aspect is already baked into the structure of the underlying data organization

There are currently two paths on the table (thanks very much to your help):

  • Sharding + routing: I move data partition logic into the application layer. I stand up an index with 100 or more shards, and route my insert/update/select/delete calls to the appropriate shard via routing.

  • buckets: I create buckets, to hold either single accountid or a set of accountids. I then manage at the bucket layer, and let ES map the buckets to shards.

I am working to:

  • More deeply understand the benefits and issues of each approach

  • Confirm that these are the two paths I should examine (that there is not a viable third or fourth option)

Any and all input appreciated!

800gb for how many shards?

BTW why do you sort by accountid?
It does not make sense to me. Could you share a typical query?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.