Elasticsearch - compound index or discrete indexes?

Jose_Ramierez · July 20, 2018, 11:06pm

Elasticsearch 6.3 (but our ES experience is all from earlier versions)

We have a typical multi tenant cloud app. 5,000 accounts on the cloud app. Each account has 500 to 1,000,000 records/documents.

When we use a single monolithic Elasticsearch index, it is ~800GB in size, and our queries are slow to run -- really slow. Like 5-14 seconds with the data on a hot SSD based server. (We are doing sorts on different columns, and of course only ever want data for a single accountid.)

We know that the data for a single account would give us super fast ES queries, but partitioning into 5,000 indexes seems over the top, and might require a rather large cluster so that no single server has to handle 1,000 indexes.

In my SQL thinking, I keep wanting to make the big monolithic index have a "compound index" aspect (like the accountid), since every query has an accountid prefix, but how best to do this in ES?

Do bucket aggregations solve for this? (On paper they appear to)
-- Will ES always keep a top level bucket in a single shard?
Is multi-shard + routing the solution?
-- https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
-- And then run 100 or 250 shards across 5 or more servers
Or ____?

How do we find a path forward?

dadoonet · July 21, 2018, 1:00am

I'd use multi shards + routing (on account Id)
I'd also if possible create may be individual indices for "big" accounts or VIP accounts.

That way you can also dedicate some powerful nodes to the VIP accounts and let the other data goes to the other nodes.

My 2 cents.

Jose_Ramierez · July 22, 2018, 5:11pm

David,

Thank you.

If I follow your thinking (and the general thoughts behind routing), what is being said is:

"(a) Multi-shards + routing is like a lightweight approach to (b) discrete indexes. If you think about it, in both a and b require data partitioning logic to be held in the application. But it can be easier to manage and scale a single 'a' than to have thousands of discrete indexes. So build out a data partitioning layer in your app that can handle both a and b, and then tune to taste."

Warm?

My next question would be: Are there any libs out there that have implemented the app layer of this, or am I doing this myself? (No prob either way, just like to see what is being done already.)

Thank you!

dadoonet · July 22, 2018, 6:27pm

I don't think so.

Jose_Ramierez · July 23, 2018, 12:58am

David,

One other question.

How is shards + routing materially better than buckets?

dadoonet · July 23, 2018, 1:25am

I believe this is the same question as Are contents of a bucket held in the same shard? ?

Jose_Ramierez · July 23, 2018, 5:50am

David,

I am working to deeply understand my options.

Problem statement: With 5000 accounts, my index is 800GB and queries sorting on mult fields are very slow. In my case, I am always sorting by at least three fields: accountid, <something I am interested in>, document-primary-status

I believe the main solution paths are to partition the data by accountid, so that a given query is: (a) running against a smaller index (or shard), and (b) the accountid aspect is already baked into the structure of the underlying data organization

There are currently two paths on the table (thanks very much to your help):

Sharding + routing: I move data partition logic into the application layer. I stand up an index with 100 or more shards, and route my insert/update/select/delete calls to the appropriate shard via routing.
buckets: I create buckets, to hold either single accountid or a set of accountids. I then manage at the bucket layer, and let ES map the buckets to shards.

I am working to:

More deeply understand the benefits and issues of each approach
Confirm that these are the two paths I should examine (that there is not a viable third or fourth option)

Any and all input appreciated!

dadoonet · July 23, 2018, 7:11am

800gb for how many shards?

BTW why do you sort by accountid?
It does not make sense to me. Could you share a typical query?

system · August 20, 2018, 7:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advice on cluster configuration Elasticsearch	10	553	January 8, 2019
Multi tenancy drawbacks Elasticsearch	3	358	October 8, 2021
Many small indices vs One large index Elasticsearch	6	1317	November 11, 2020
Performance issue on 40TB index Elasticsearch	5	56	March 25, 2025
Splitting small amount of data over multiple vs a single index? Elasticsearch	1	523	December 13, 2017

Elasticsearch - compound index or discrete indexes?

Related topics