Migration from AWS CloudSearch


(Chris Pappas) #1

My company is currently working on migration away from Amazon CloudSearch (ACS) to Elastic.

ACS Domains are basically a single index, unlike ES which as I've learned can support virtually unlimited numbers of distinct indices (each with their own fields etc). Please correct me if I'm wrong.

Currently we have many ACS domains which are relatively very expensive for the usage they receive (on the order of < 10k requests per day). I would like to be able to combine a few (or all) of those domains into one Elastic Cloud instance to save $$ and optimize usage... no sense over-provisioning these for the low usage they receive.

One of my developers is telling me this is a bad practice, that separate indexes have more overhead/slower speeds. Is this accurate?

Assuming the underlying Elastic Cloud cluster is provisioned well, would my approach of mapping each unique ACS domain to their own ES index work?


(Shane Connelly) #2

You're both kind of right!

Separate indices have more shards, and indices, shards, and segments all do have a few different types of overhead. I think that if you read https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster you'll have most of your questions answered. We watch out for "shard explosion" with customers which can cause a variety of problems outlined in that blog, but we're usually talking 10s of thousands of shards vs it sounds like you're potentially talking dozens to hundreds of shards if I'm getting a sense here.

Practically, if you're on a few thousands requests per day, you're talking about a few indices in total, and your data is optimized with correct mappings such that you don't need crazy script queries with enormous aggregations, etc in all likelihood, you'll probably be fine. If you really want to dive in, I'd encourage you to benchmark different setups with real-world queries and data with Rally.

One thing you should know in a multi-tenant environment in a single cluster is that there isn't full resource isolation. That can only really come by setting up multiple clusters. Sometimes we see people with the hopes that they'll run 100s of tenants on the same cluster and all of them have completely evenly use resources which is not the case. If a really bad tenant has complete control of the data they index or the queries they write, they could insert enormous blocks of text and then try to do crazy regular expression queries with nasty scripts and massive aggregations against them. Elasticsearch has a lot of protections against this type of thing and we're adding more, but you may want to consider a layer that issues queries for them to prevent abuses if this is the architecture you're going for.


(Chris Pappas) #3

Thank you for the reply, I'm glad I got a response from someone "official" with Elastic Co.

I should have clarified when I said "many"; we currently have about 10 ACS domains total. Each of which is totally under our company control. Our "tenants" are simply services or other things that consume/write to the ACS domains. And many of them are simply being treated almost like a "serverless" data API/store (a real anti-pattern that as a relatively new hire I'm working to fix).

It sounds like we would likely be within the safe range - we would never run 100s of "tenants", I think at most we might have the same 10-15 range.

Cheers!


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.