One large index vs. many smaller indexes


(Chris Neal) #1

Hi all,

As the subject says, I'm wondering about index size vs. number of indexes.

I'm indexing many application log files, currently with an index by day for
all logs, which will make a very large index. For just a few applications
in Development, the index is 55GB a day (across 2 servers). In prod with
all applications, it will be "much more than that". 1TB a day maybe?

I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.

Is it just a resource question? If I have enough RAM/disk/CPU to support a
"mammoth" index, then I'm fine? Or are there other reasons to (or to not)
split up indexes?

Very much appreciate your time.
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

Hi Chris,

Usually, the problem is not that much in terms of indices but shards, which
are the physical units of data storage (an index being a logical view over
several shards).

Something to beware of is that shards typically have some constant overhead
(disk space, file descriptors, memory usage) that does not depend on the
amount of data that they store. Although it would be ok to have up to a few
tens of shards per nodes, you should avoid to have eg. thousands of shards
per node.

if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.

You might want to check out the following resources that talk about
capacity planning:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html

On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

As the subject says, I'm wondering about index size vs. number of indexes.

I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?

I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.

Is it just a resource question? If I have enough RAM/disk/CPU to support
a "mammoth" index, then I'm fine? Or are there other reasons to (or to
not) split up indexes?

Very much appreciate your time.
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Chris Neal) #3

Adrien,

Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.

One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).

Chris

On Fri, Aug 22, 2014 at 5:58 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi Chris,

Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).

Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.

if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.

You might want to check out the following resources that talk about
capacity planning:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html

On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

As the subject says, I'm wondering about index size vs. number of indexes.

I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?

I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.

Is it just a resource question? If I have enough RAM/disk/CPU to support
a "mammoth" index, then I'm fine? Or are there other reasons to (or to
not) split up indexes?

Very much appreciate your time.
Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #4

I meant tens of shards per node. So if you have N nodes with I indices
which have S shards and R replicas, that would be (I * S * (1 + R)) / N.

One shard per node is optimal but doesn't allows for growth: if you add one
more node, you cannot spread the indexing work load, that is why it is
common to have a few shards per node in order to allow elasticsearch to
spread the load in case you would introduce a new node in your cluster to
improve your cluster capacity.

On Mon, Aug 25, 2014 at 12:07 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Adrien,

Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.

One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).

Chris

On Fri, Aug 22, 2014 at 5:58 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi Chris,

Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).

Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.

if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.

You might want to check out the following resources that talk about
capacity planning:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html

On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

As the subject says, I'm wondering about index size vs. number of
indexes.

I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?

I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.

Is it just a resource question? If I have enough RAM/disk/CPU to
support a "mammoth" index, then I'm fine? Or are there other reasons to
(or to not) split up indexes?

Very much appreciate your time.
Chris

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5KGu34xCh6e5PKFm30U8mNAf-0acd7%3DQMAVuriL3msyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Chris Neal) #5

Thanks Adrien!

Very much appreciate your time and help.

Chris

On Mon, Aug 25, 2014 at 3:44 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

I meant tens of shards per node. So if you have N nodes with I indices
which have S shards and R replicas, that would be (I * S * (1 + R)) / N.

One shard per node is optimal but doesn't allows for growth: if you add
one more node, you cannot spread the indexing work load, that is why it is
common to have a few shards per node in order to allow elasticsearch to
spread the load in case you would introduce a new node in your cluster to
improve your cluster capacity.

On Mon, Aug 25, 2014 at 12:07 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Adrien,

Thanks so much for the response. It was very helpful. I will check out
those links on capacity planning for sure.

One followup question. You mention that tens of shards per node would be
ok. Are you meaning tens of shards from tens of indexes? Or tens of
shards for a single index? Right now I have two servers configured with
the index getting 2 shards (one per server), and 1 replica (per server).

Chris

On Fri, Aug 22, 2014 at 5:58 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi Chris,

Usually, the problem is not that much in terms of indices but shards,
which are the physical units of data storage (an index being a logical view
over several shards).

Something to beware of is that shards typically have some constant
overhead (disk space, file descriptors, memory usage) that does not depend
on the amount of data that they store. Although it would be ok to have up
to a few tens of shards per nodes, you should avoid to have eg. thousands
of shards per node.

if you plan on always adding a filter for a specific application in your
search requests, then splitting by application makes sense since this will
make the filter useless at search time, you will just need to query the
application-specific index. On the other hand if you don't filter by
application, then splitting data by yourself into smaller indices would be
pretty equivalent to storing everything in a single index with a higher
number of shards.

You might want to check out the following resources that talk about
capacity planning:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/capacity-planning.html

On Fri, Aug 22, 2014 at 9:08 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

As the subject says, I'm wondering about index size vs. number of
indexes.

I'm indexing many application log files, currently with an index by day
for all logs, which will make a very large index. For just a few
applications in Development, the index is 55GB a day (across 2 servers).
In prod with all applications, it will be "much more than that". 1TB a
day maybe?

I'm wondering if there is value in splitting the indexes by day and by
application, which would produce more indexes per day, but they would be
smaller, vs. value in having a single, mammoth index by day alone.

Is it just a resource question? If I have enough RAM/disk/CPU to
support a "mammoth" index, then I'm fine? Or are there other reasons to
(or to not) split up indexes?

Very much appreciate your time.
Chris

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DphfsYx0LW0M-yvLWGauRSzVWG0etaBkiTrN7zVafq7tMA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5i7AAnasMYZgR83aTXvELan%3DkR6OLvGYKfs9d5Subi4A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dph9Z1My%2B2%2BQ-NM-sWNn2vT1qktDi6%2BmR-b9rFN-Xc-_pw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5KGu34xCh6e5PKFm30U8mNAf-0acd7%3DQMAVuriL3msyA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5KGu34xCh6e5PKFm30U8mNAf-0acd7%3DQMAVuriL3msyA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DpjrpaV7NB87%3DLxRmAa%2B0RUgQ3oY%3DtuaB9Bc274e9jG9og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6