We are currently evaluating ElasticSearch for our products needs and have
run into some difficulty figuring out how to deploy it and scale it within
our product. We currently have around 30,000 customers (companies)... some
of which are small (5000 documents) and some of which are large (2,000,000+
documents). As customers grow over time they may increase in size and move
from a small or medium sized customer into a large one.
We would like to index all of our customers documents in Elasticsearch. But
we have had problems with each scenario we have considered. Here are the
proposals and the problems we have faced... any advice is appreciated.
First Proposal: One index for each customer.
Problems:
When we tested with 500 small indexes (each index has the default 5
shards) on one server (-Xms4g, -Xmx6g), the server started extremely
slowly. It took 30 minutes for server to go from Red to Yellow status and
when we tested with 1000 indexes it took 60 minutes.
The other problem with this setup is RAM usage. It grabbed around 1.6 GB
of java heap space for start-up even when there was no load. When we
provided search load on the indexes, the heap increased to 5.1GB (and the
GC didn't release the RAM after we stopped loading).
With this setup we would be able to manage and remove customers very easily
and we would prefer to setup our cluster with this model if we can find a
solution for the problem but our initial tests really disappointed us.
Second Proposal: One index for large customers and using a few large
indexes for all the smaller customers.
Problems:
Migration from a jointly used index into a single large index would be
difficult. (we would likely need to do this if a customer got to big in
order to improve the response times) and re-indexing documents would be
quite difficult and slow for large data sets.
Deleting customers would be more difficult.
With the first solution we could easily remove an index folder when a
customer is deleted, but with a multi-tenant solution we would need to
delete their documents from a shared index (We have no idea how heavy
delete operation would be and how it would effect the optimization process)
Any advice you can give to help us find a practical solution is greatly
appreciated.
Second Proposal: One index for large customers and using a few large
indexes for all the smaller customers.
Problems:
Migration from a jointly used index into a single large index would be
difficult. (we would likely need to do this if a customer got to big in
order to improve the response times) and re-indexing documents would be
quite difficult and slow for large data sets.
First, it wouldn't be that slow - depends on how much data, what hardware
you have etc. Second, you can do it in the background, then switch the
customer alias from the shared index to the dedicated index in one atomic
step.
Deleting customers would be more difficult.
Just use a delete-by-query. It's not as efficient as dropping an index,
but it will work fine.
Hi Clint... so are you saying that the second proposal is the better
solution and we should not really consider the first? Do you think the
second solution is the way to go? I am a bit concerned about re-indexing
1,000,000+ documents during a "move" especially given that there would be
an increased load on the other resources while the data is opened and
extracted in order to re-send it to ES to perform the indexing. I am pretty
sure we want to avoid the re-indexing if possible.
Do you think the 1 index per company is bad idea given what you know about
ES?
Drew
On Tuesday, May 21, 2013 6:46:05 AM UTC-4, Clinton Gormley wrote:
Second Proposal: One index for large customers and using a few large
indexes for all the smaller customers.
Problems:
Migration from a jointly used index into a single large index would
be difficult. (we would likely need to do this if a customer got to big in
order to improve the response times) and re-indexing documents would be
quite difficult and slow for large data sets.
First, it wouldn't be that slow - depends on how much data, what hardware
you have etc. Second, you can do it in the background, then switch the
customer alias from the shared index to the dedicated index in one atomic
step.
Deleting customers would be more difficult.
Just use a delete-by-query. It's not as efficient as dropping an index,
but it will work fine.
Do you think the 1 index per company is bad idea given what you
know about ES?
Neither of Reza's approaches is wrong. Each has trade-offs.
Single-tenant indices have a lot of advantages, you just have to
do a little more work client-side to make them scale well. Here's
one way to do it.
Personally, I'd go for the second option. Reindexing a million docs should
be done in way less than an hour (depending of course on docs, hardware
etc). And you can control your indexing speed so that you don't overwhelm
your resources.
Do you think the 1 index per company is bad idea given what you know
about ES?
Neither of Reza's approaches is wrong. Each has trade-offs. Single-tenant
indices have a lot of advantages, you just have to do a little more work
client-side to make them scale well. Here's one way to do it.
I was wondering whether it is reasonable that Elasticsearch supports a
variation of algorithm to control the number of open indexes. For example a
LRU model can close the least recently used index when the total number of
open indexes has exceeded a constant number.
Reza
On Thursday, May 23, 2013 1:20:43 PM UTC+3:30, Clinton Gormley wrote:
Hi Drew (Morris)
Personally, I'd go for the second option. Reindexing a million docs
should be done in way less than an hour (depending of course on docs,
hardware etc). And you can control your indexing speed so that you don't
overwhelm your resources.
clint
On 21 May 2013 15:43, Drew Raines <aara...@gmail.com <javascript:>> wrote:
Drew Morris wrote:
Do you think the 1 index per company is bad idea given what you know
about ES?
Neither of Reza's approaches is wrong. Each has trade-offs.
Single-tenant indices have a lot of advantages, you just have to do a
little more work client-side to make them scale well. Here's one way to do
it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.