A minimal setup for an evergrowing data set


(Rotem Hermon) #1

Hi,

First kudos for this great product!

I'm working on incorporating an ES cluster for indexing our data. We're a
young startup with limited resources so I'm trying to figure out a minimal
setup that will hold.

Our data - we have around 15M documents per month, each around 1 - 2 K, a
short text with several additional meta data fields, some in nested
objects. So it's around 25GB of data per month. Data keeps flowing
constantly at a rate of several docs per second. Traffic is still low so no
more than a few queries per second.

According to recommendations I found in the forum it seems like the setup
should be:
Creating an index per month, aliasing them under one name with filter by
data so I can query all according to the date range I need.
Each index with 1 shard and 1 replica.

We're planning to start with a 2 node setup (which means all shards on a
single node and one for backup).

My only concern are memory issues. How are indexes handled in this
scenario? Will old indexes that are not queried a lot still impact the
memory needed? And if they are queried at some point (if querying old
data), can it cause out of memory errors like those I see reported? How
much memory should a node have for this kind of data and index setup?

Any pointers will be much appreciated.


(Rotem Hermon) #2

A continuation of this question - after playing a little with aliases and
filters I see it's not exactly what I thought. I'm creating an index per
month, and alias all indexes to the same name so that I can search on all
of them.

I thought that by creating a filter with the alias it will automatically
route the search to the right index according to the filtered value. For
example:
"add" : {
"index" : "nov-2011",
"alias" : "top-index",
"filter" : {
"range":
{
"pt.$date":
{
"from":"2011-11-01",
"to":"2011-11-30"
}
}
}
}

But it doesn't seem to work that way. So I guess I misunderstood how filter
works with aliases?

So the question is what is the benefit of creating an index by month, if
the query still needs to hit all of them?


(Mike Kelp) #3

I believe the benefit is that an alias can easily be changed without having
to do a merge on the documents to remove them, etc. As you remove a month
from the alias later, you can then delete that whole index if you wish, so
purging documents from one large index does not become a problem.

Basically, the alias allows you to better manage what data is included in
your search.

I'm pretty new to this as well, so please take my answer with a reasonably
sized rock of salt.

Mike.


(Shay Banon) #4

First, the filter associated with an alias is simply there to filter the
results, nothing more. You will need to choose the indices to query if you
decide the create an index per month.

Regarding the memory usage, then "old" indices still retain the memory used
to support them. You can close an index (but then it will not be
searchalbe/indexable).

On Wed, Nov 23, 2011 at 5:08 PM, Rotem rotem.hermon@gmail.com wrote:

A continuation of this question - after playing a little with aliases and
filters I see it's not exactly what I thought. I'm creating an index per
month, and alias all indexes to the same name so that I can search on all
of them.

I thought that by creating a filter with the alias it will automatically
route the search to the right index according to the filtered value. For
example:
"add" : {
"index" : "nov-2011",
"alias" : "top-index",
"filter" : {
"range":
{
"pt.$date":
{
"from":"2011-11-01",
"to":"2011-11-30"
}
}
}
}

But it doesn't seem to work that way. So I guess I misunderstood how
filter works with aliases?

So the question is what is the benefit of creating an index by month, if
the query still needs to hit all of them?


(Rotem Hermon) #5

OK. Any pointers as to what is a reasonable memory requirement for this
kind of data size?


(Shay Banon) #6

Nothing aside from doing capacity tests. Obviously the data size has an
affect, but, it depends on many factors such as number / size of terms,
sorting / faceting on fields. The node stats API is your friend here giving
a lot of info on memory used.

On Sun, Nov 27, 2011 at 11:35 PM, Rotem rotem.hermon@gmail.com wrote:

OK. Any pointers as to what is a reasonable memory requirement for this
kind of data size?


(system) #7