Bulk indexing terabytes of time-based data

I am about to index many terabytes of time-based data using Amazon AWS, so
I want to get it right! I would appreciate advice.

My current plan is to stream the data into different indices, changing
index when it reaches a given maximum size (10GB?). Each index will have a
single shard, since I am effectively sharding by time.

I will use an S3 gateway to store cluster state. The issue is primary
storage.

The recommendation is to use use local storage for primary storage so that
restarts happen quickly. If I do this, then I need to maintain a mapping
from index to volume. Then, to access a set of indices, I need to make
sure that the volume containing that index is mounted and an ES node is
running using that volume.

Allocating one index per volume is simple. The AWS volume would be small,
and therefore easily provisioned. (The larger the AWS volume, the longer it
takes AWS to provision the new volume). However, to bring a set of indices
online, I would need to mount each volume at a different mount point, and
configure an ES node to point to that mount point. Using one ES instance
per volume means that I could use the same logical mount point and config
file, but is expensive, since I would have to run a large number of ES
instances.

Allocating multiple indices per volume seems to be the better answer. I've
had trouble provisioning 1TB EBS volumes, so I'm thinking of setting the
volume size at 500GB. Then, I could put nearly 50 indices on each volume.

I think the best approach is to place indices that are close in time
together on the same volume. Then, I can eventually archive an entire
volume when the time interval represented by that volume is no longer
needed.

So, let's say I have a large number of documents (of roughly the same
size), roughly sorted in time. My indexing algorithm is as follows.

Do a test indexing run to determine the average number of index (on disk)
bytes per document. Since my target index size is 10GB, index enough
documents to get to 5GB. The average would then be an overestimate, since
we assume that index size grows sub-linearly with number of documents.

With this average, partition the document sequence into sub-sequences of
(500GB / (index bytes per document)) documents. Allocate a separate AWS
EBS volume for each sub-sequence. To index in parallel, provision one AWS
EC2 instance per EBS volume. Allocate one elasticsearch node with file
system storage and s3 gateway to index the sub-sequence. Process the
documents in order. When an index fills up, close the index and open a new
one.

After indexing is done, I will have K volumes with approximately 50K
indices total. To run a query over a given time range, I determine which
volumes need to be online and ensure that they are online. If I need more
query throughput, I add a number of other ES nodes to the system. My
understanding is that elasticseach will migrate the indices to these nodes
to balance the number of shards per node. If adding one node pre index is
not enough, I can increase the number of nodes and change the number of
replicas for each index, correct?

Question: If I want to permanently move an index to a different EBS volume,
what is the best way to do that? Say I have 50 indices on one volume, and I
want to a certain sub-set of them to another volume. What do I do?

The simplest design to do what you're describing (indexing time based data)
would be to use ES local storage, set replica count to 1, and let ES manage
which index gets stored where. You can use multiple EC2 instances, each
server instance with multiple EBS volumes combined. You can create new
indices as you add data, create a new index every time index size gets over
10GB as you describe. When the load is too much for the nodes in the
cluster, you can add a new node to the cluster, and ES would re-balance the
indices among the nodes automatically. You can close old indices, so that
they won't consume resources, and only open them back if needed. You can
even move the old indices to S3 if you don't expect them to be searched
regularly and can tolerate the time it would take for it to get copied back.

This is the road well traveled, so to speak, but may be its not suitable
for some requirements you have. A lot of what you're describing seems to be
trying to control where an index is located, can you explain the reasoning
for this? Also what is the reasoning for using S3 gateway and not use local
storage?

I think you should test how much data a single node can handle, and
extrapolate from there.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Jan 20, 2012 at 3:21 PM, Derrick derrickrburns@gmail.com wrote:

I am about to index many terabytes of time-based data using Amazon AWS, so
I want to get it right! I would appreciate advice.

My current plan is to stream the data into different indices, changing
index when it reaches a given maximum size (10GB?). Each index will have a
single shard, since I am effectively sharding by time.

I will use an S3 gateway to store cluster state. The issue is primary
storage.

The recommendation is to use use local storage for primary storage so that
restarts happen quickly. If I do this, then I need to maintain a mapping
from index to volume. Then, to access a set of indices, I need to make
sure that the volume containing that index is mounted and an ES node is
running using that volume.

Allocating one index per volume is simple. The AWS volume would be small,
and therefore easily provisioned. (The larger the AWS volume, the longer it
takes AWS to provision the new volume). However, to bring a set of indices
online, I would need to mount each volume at a different mount point, and
configure an ES node to point to that mount point. Using one ES instance
per volume means that I could use the same logical mount point and config
file, but is expensive, since I would have to run a large number of ES
instances.

Allocating multiple indices per volume seems to be the better answer.
I've had trouble provisioning 1TB EBS volumes, so I'm thinking of setting
the volume size at 500GB. Then, I could put nearly 50 indices on each
volume.

I think the best approach is to place indices that are close in time
together on the same volume. Then, I can eventually archive an entire
volume when the time interval represented by that volume is no longer
needed.

So, let's say I have a large number of documents (of roughly the same
size), roughly sorted in time. My indexing algorithm is as follows.

Do a test indexing run to determine the average number of index (on disk)
bytes per document. Since my target index size is 10GB, index enough
documents to get to 5GB. The average would then be an overestimate, since
we assume that index size grows sub-linearly with number of documents.

With this average, partition the document sequence into sub-sequences of
(500GB / (index bytes per document)) documents. Allocate a separate AWS
EBS volume for each sub-sequence. To index in parallel, provision one AWS
EC2 instance per EBS volume. Allocate one elasticsearch node with file
system storage and s3 gateway to index the sub-sequence. Process the
documents in order. When an index fills up, close the index and open a new
one.

After indexing is done, I will have K volumes with approximately 50K
indices total. To run a query over a given time range, I determine which
volumes need to be online and ensure that they are online. If I need more
query throughput, I add a number of other ES nodes to the system. My
understanding is that elasticseach will migrate the indices to these nodes
to balance the number of shards per node. If adding one node pre index is
not enough, I can increase the number of nodes and change the number of
replicas for each index, correct?

Question: If I want to permanently move an index to a different EBS
volume, what is the best way to do that? Say I have 50 indices on one
volume, and I want to a certain sub-set of them to another volume. What do
I do?

In my application, I will be indexing a very large amount of data that
grows over time and I will need to archive old data over time. My quest to
determine which volume an index is located on was predicated on the
assumption that the only way to archive the data was to keep track of which
volume it is located on.

However, in my reading today, perhaps I have discovered the elegant ES
solution: when it is time to migrate data offline, use the index routing
feature to direct ES to move an index to a particular node:

index.routing.alloction.include.tag: value
*
*
*
Presumably, a short time later, the indices will be located at the target
node, then I can decommission the node (and the volume). Is this what you
would suggest?

I was considering an S3 gateway to completely abstract the need for local
EBS storage. However, that doesn't work because I can ill afford to keep
all indices in main memory. So, given that I need local EBS storage, I
don't need an S3 gateway anymore!

I definitely want to stay on the well-traveled road! Thanks for pointing
it out!!

Derrick, I think you keep missing core parts of how elasticsearch works. I
explain in another mail (actually, in a few responses) that you don't need
to allocate a specific volume for an index (see other response), a node has
a data location, and thats where shards will live.

Regarding the gateway, s3 gateway does not mean the indices will be in
memory. When you use local gateway or shared (s3) gateway, the shards data
is stored on the node they are allocated on. When you use the s3 gateway,
that data is also constantly copied over to s3, which if you use the local
gateway, the cluster will know to recover itself from each node local data.

On Sat, Jan 21, 2012 at 4:56 AM, Derrick derrickrburns@gmail.com wrote:

In my application, I will be indexing a very large amount of data that
grows over time and I will need to archive old data over time. My quest to
determine which volume an index is located on was predicated on the
assumption that the only way to archive the data was to keep track of which
volume it is located on.

However, in my reading today, perhaps I have discovered the elegant ES
solution: when it is time to migrate data offline, use the index routing
feature to direct ES to move an index to a particular node:

index.routing.alloction.include.tag: value
*
*
*
Presumably, a short time later, the indices will be located at the target
node, then I can decommission the node (and the volume). Is this what you
would suggest?

I was considering an S3 gateway to completely abstract the need for local
EBS storage. However, that doesn't work because I can ill afford to keep
all indices in main memory. So, given that I need local EBS storage, I
don't need an S3 gateway anymore!

I definitely want to stay on the well-traveled road! Thanks for pointing
it out!!

Yes, I know I appear to be dense, but I was trying to figure out how to
make all resources (CPU, I/O, STORAGE) completely elastic when deploying on
AWS, yet do so economically as my demand fluctuates. My questions were
probing at ways to do so.

Elasticsearch goes a LONG way to treating all resources as elastic. I
understand that to make CPU and I/O elastic, elasticsearch moves indices
from one node to another. This is very cool, but implicitly requires that
the destination node have enough STORAGE to receive indices. If each node
has an elastic STORAGE pool, then all resources are elastic.

However, I am deploying on AWS, which offers provides primary storage
using EBS or instance volumes. Neither is transparently elastic. I can
resize EBS volumes, but this is not transparent to me! I don't know of a
primary elastic storage option available on AWS. Is there one?

AWS does offer an elastic storage option, S3. So, I was hoping that
elasticsearch could use S3 as primary storage. However, elasticsearch uses
S3 as backing store (and cluster metadata store) only. So, my hopes for a
truly transparently elastic storage option on AWS appear to be dashed.

So, if primary storage is not elastic, then in what sizes do I allocate it.
This led me to think the range of choices, from an EBS volume per EBS
shard to a mega-volume to storage a large number of shards. The former
would require me to have a large number of AWS EC2 instances, so I ruled
that out. The latter would allow me to bring all indices online with a
small number of ES nodes. I call these "anchor" nodes.

When I want to perform bulk indexing or large numbers of queries, I can
bring more ES nodes on line, each with their own local storage. When
demand subsides, I can (optionally) migrate the data from those nodes to
the "anchor" nodes.

Now, if you know of another way to make storage elastic on AWS, please tell
me! If I owned the infrastructure, I could use NFS or another shared
storage pool. I'd still have the problem of scaling I/O bandwidth to the
shared storage however, a problem you no doubt have much experience with
given your previous search design experience.