As the subject says, I'm wondering about index size vs. number of indexes.
I'm indexing many application log files, currently with an index by day for
all logs, which will make a very large index. For just a few applications
in Development, the index is 55GB a day (across 2 servers). In prod with
all applications, it will be "much more than that". 1TB a day maybe?
I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.
Is it just a resource question? If I have enough RAM/disk/CPU to support a
"mammoth" index, then I'm fine? Or are there other reasons to (or to not)
split up indexes?
Usually, the problem is not that much in terms of indices but shards, which
are the physical units of data storage (an index being a logical view over
several shards).
Something to beware of is that shards typically have some constant overhead
(disk space, file descriptors, memory usage) that does not depend on the
amount of data that they store. Although it would be ok to have up to a few
tens of shards per nodes, you should avoid to have eg. thousands of shards
per node.
if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.
You might want to check out the following resources that talk about
capacity planning:
As the subject says, I'm wondering about index size vs. number of indexes.
I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?
I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.
Is it just a resource question? If I have enough RAM/disk/CPU to support
a "mammoth" index, then I'm fine? Or are there other reasons to (or to
not) split up indexes?
Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.
One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).
Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).
Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.
if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.
You might want to check out the following resources that talk about
capacity planning:
As the subject says, I'm wondering about index size vs. number of indexes.
I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?
I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.
Is it just a resource question? If I have enough RAM/disk/CPU to support
a "mammoth" index, then I'm fine? Or are there other reasons to (or to
not) split up indexes?
I meant tens of shards per node. So if you have N nodes with I indices
which have S shards and R replicas, that would be (I * S * (1 + R)) / N.
One shard per node is optimal but doesn't allows for growth: if you add one
more node, you cannot spread the indexing work load, that is why it is
common to have a few shards per node in order to allow elasticsearch to
spread the load in case you would introduce a new node in your cluster to
improve your cluster capacity.
Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.
One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).
Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).
Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.
if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.
You might want to check out the following resources that talk about
capacity planning:
As the subject says, I'm wondering about index size vs. number of
indexes.
I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?
I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.
Is it just a resource question? If I have enough RAM/disk/CPU to
support a "mammoth" index, then I'm fine? Or are there other reasons to
(or to not) split up indexes?
I meant tens of shards per node. So if you have N nodes with I indices
which have S shards and R replicas, that would be (I * S * (1 + R)) / N.
One shard per node is optimal but doesn't allows for growth: if you add
one more node, you cannot spread the indexing work load, that is why it is
common to have a few shards per node in order to allow elasticsearch to
spread the load in case you would introduce a new node in your cluster to
improve your cluster capacity.
Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.
One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).
Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).
Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.
if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.
You might want to check out the following resources that talk about
capacity planning:
As the subject says, I'm wondering about index size vs. number of
indexes.
I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?
I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.
Is it just a resource question? If I have enough RAM/disk/CPU to
support a "mammoth" index, then I'm fine? Or are there other reasons to
(or to not) split up indexes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.