I am about to index many terabytes of time-based data using Amazon AWS, so
I want to get it right! I would appreciate advice.
My current plan is to stream the data into different indices, changing
index when it reaches a given maximum size (10GB?). Each index will have a
single shard, since I am effectively sharding by time.
I will use an S3 gateway to store cluster state. The issue is primary
storage.
The recommendation is to use use local storage for primary storage so that
restarts happen quickly. If I do this, then I need to maintain a mapping
from index to volume. Then, to access a set of indices, I need to make
sure that the volume containing that index is mounted and an ES node is
running using that volume.
Allocating one index per volume is simple. The AWS volume would be small,
and therefore easily provisioned. (The larger the AWS volume, the longer it
takes AWS to provision the new volume). However, to bring a set of indices
online, I would need to mount each volume at a different mount point, and
configure an ES node to point to that mount point. Using one ES instance
per volume means that I could use the same logical mount point and config
file, but is expensive, since I would have to run a large number of ES
instances.
Allocating multiple indices per volume seems to be the better answer. I've
had trouble provisioning 1TB EBS volumes, so I'm thinking of setting the
volume size at 500GB. Then, I could put nearly 50 indices on each volume.
I think the best approach is to place indices that are close in time
together on the same volume. Then, I can eventually archive an entire
volume when the time interval represented by that volume is no longer
needed.
So, let's say I have a large number of documents (of roughly the same
size), roughly sorted in time. My indexing algorithm is as follows.
Do a test indexing run to determine the average number of index (on disk)
bytes per document. Since my target index size is 10GB, index enough
documents to get to 5GB. The average would then be an overestimate, since
we assume that index size grows sub-linearly with number of documents.
With this average, partition the document sequence into sub-sequences of
(500GB / (index bytes per document)) documents. Allocate a separate AWS
EBS volume for each sub-sequence. To index in parallel, provision one AWS
EC2 instance per EBS volume. Allocate one elasticsearch node with file
system storage and s3 gateway to index the sub-sequence. Process the
documents in order. When an index fills up, close the index and open a new
one.
After indexing is done, I will have K volumes with approximately 50K
indices total. To run a query over a given time range, I determine which
volumes need to be online and ensure that they are online. If I need more
query throughput, I add a number of other ES nodes to the system. My
understanding is that elasticseach will migrate the indices to these nodes
to balance the number of shards per node. If adding one node pre index is
not enough, I can increase the number of nodes and change the number of
replicas for each index, correct?
Question: If I want to permanently move an index to a different EBS volume,
what is the best way to do that? Say I have 50 indices on one volume, and I
want to a certain sub-set of them to another volume. What do I do?