How many indices ES can support

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around few
Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So question
begs is how many indices ES can support?

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello,

On Thu, May 9, 2013 at 2:25 AM, vinod eligeti veligeti999@gmail.com wrote:

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around few
Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So question
begs is how many indices ES can support?

I'm not aware of such a limit. However, more indices imply more shards, and
each shard comes with a memory overhead. And, assuming you keep the same
merge policies, having more shards means more segments - so your data will
be less compact (more disk space used), and you'll need more open files.

Since you say users send terabytes of logs per day, I assume you have a
pretty large cluster to handle that, so maybe the problems mentioned above
are negligible.

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Yes, your assumption is correct. Deleting individual docs means searching
them all and marking them as deleted, so they can be purged in the next
merge (which will make merging more expensive as well). Deleting indices is
pretty much like deleting files on the disk.

If you want to have only one index per day, it might be worth to put each
user's logs in a different type. The type name is essentially just another
field (so no performance gain), but it's easier to index, search, and
delete when you're using the type name in the URL.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Radu,

As per my understanding as what you said for example, if I have User1 and
User2 as 2 doc types, Its much faster to delete the a single doc type,
say User1, than to have one doc type say Log and in that have a field *
user* which can hold values User1 and User2, is that correct?

I ES Guide section, doc type is just any other field of a document and
defined by *_type, *implicit field, if thats the case why would deletion a
doc type is faster than user field as deleting doc type internally should
have mapped to _type field which is same as bulk delete query of *user=User1
*?

I hope I am not confusing you with my example.

thanks

On Thursday, May 9, 2013 12:27:28 AM UTC-7, Radu Gheorghe wrote:

Hello,

On Thu, May 9, 2013 at 2:25 AM, vinod eligeti <velig...@gmail.com<javascript:>

wrote:

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around few
Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So
question begs is how many indices ES can support?

I'm not aware of such a limit. However, more indices imply more shards,
and each shard comes with a memory overhead. And, assuming you keep the
same merge policies, having more shards means more segments - so your data
will be less compact (more disk space used), and you'll need more open
files.

Since you say users send terabytes of logs per day, I assume you have a
pretty large cluster to handle that, so maybe the problems mentioned above
are negligible.

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Yes, your assumption is correct. Deleting individual docs means searching
them all and marking them as deleted, so they can be purged in the next
merge (which will make merging more expensive as well). Deleting indices is
pretty much like deleting files on the disk.

If you want to have only one index per day, it might be worth to put each
user's logs in a different type. The type name is essentially just another
field (so no performance gain), but it's easier to index, search, and
delete when you're using the type name in the URL.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey vinod,

you are right, that the type is just another document field. Deleting by
type still means you delete data from an index instead of deleting the
whole index in your date+user index name example, which might be the better
way to go. Having indices per user and date comes at the cost of having to
open more file descriptors and use memory per index, but might save you a
lot time with your housekeeping routines, especially if you only need to
hold the data for a limited amount of time (if you got enough disk space it
might also be an option to just close the index, instead of deleting it, so
it does not eat up any other resources than diskspace).

--Alex

On Thu, May 9, 2013 at 8:24 PM, vinod eligeti veligeti999@gmail.com wrote:

Radu,

As per my understanding as what you said for example, if I have User1and
User2 as 2 doc types, Its much faster to delete the a single doc type,
say User1, than to have one doc type say Log and in that have a field *
user* which can hold values User1 and User2, is that correct?

I ES Guide section, doc type is just any other field of a document and
defined by *_type, implicit field, if thats the case why would deletion
a doc type is faster than user field as deleting doc type internally should
have mapped to _type field which is same as bulk delete query of *
user=User1
?

I hope I am not confusing you with my example.

thanks

On Thursday, May 9, 2013 12:27:28 AM UTC-7, Radu Gheorghe wrote:

Hello,

On Thu, May 9, 2013 at 2:25 AM, vinod eligeti velig...@gmail.com wrote:

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around
few Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So
question begs is how many indices ES can support?

I'm not aware of such a limit. However, more indices imply more shards,
and each shard comes with a memory overhead. And, assuming you keep the
same merge policies, having more shards means more segments - so your data
will be less compact (more disk space used), and you'll need more open
files.

Since you say users send terabytes of logs per day, I assume you have a
pretty large cluster to handle that, so maybe the problems mentioned above
are negligible.

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Yes, your assumption is correct. Deleting individual docs means searching
them all and marking them as deleted, so they can be purged in the next
merge (which will make merging more expensive as well). Deleting indices is
pretty much like deleting files on the disk.

If you want to have only one index per day, it might be worth to put each
user's logs in a different type. The type name is essentially just another
field (so no performance gain), but it's easier to index, search, and
delete when you're using the type name in the URL.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Right. Sorry if I just added confusion to the matter. AFAIK there's no
difference in performance between having a type per user and having the
user in an extra field. I just think it's easier to use that way, not
faster.

Best regards,
Radu

On Fri, May 10, 2013 at 10:50 AM, Alexander Reelsen alr@spinscale.dewrote:

Hey vinod,

you are right, that the type is just another document field. Deleting by
type still means you delete data from an index instead of deleting the
whole index in your date+user index name example, which might be the better
way to go. Having indices per user and date comes at the cost of having to
open more file descriptors and use memory per index, but might save you a
lot time with your housekeeping routines, especially if you only need to
hold the data for a limited amount of time (if you got enough disk space it
might also be an option to just close the index, instead of deleting it, so
it does not eat up any other resources than diskspace).

--Alex

On Thu, May 9, 2013 at 8:24 PM, vinod eligeti veligeti999@gmail.comwrote:

Radu,

As per my understanding as what you said for example, if I have User1and
User2 as 2 doc types, Its much faster to delete the a single doc type,
say User1, than to have one doc type say Log and in that have a field
user which can hold values User1 and User2, is that correct?

I ES Guide section, doc type is just any other field of a document and
defined by *_type, implicit field, if thats the case why would deletion
a doc type is faster than user field as deleting doc type internally should
have mapped to _type field which is same as bulk delete query of *
user=User1
?

I hope I am not confusing you with my example.

thanks

On Thursday, May 9, 2013 12:27:28 AM UTC-7, Radu Gheorghe wrote:

Hello,

On Thu, May 9, 2013 at 2:25 AM, vinod eligeti velig...@gmail.comwrote:

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around
few Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So
question begs is how many indices ES can support?

I'm not aware of such a limit. However, more indices imply more shards,
and each shard comes with a memory overhead. And, assuming you keep the
same merge policies, having more shards means more segments - so your data
will be less compact (more disk space used), and you'll need more open
files.

Since you say users send terabytes of logs per day, I assume you have a
pretty large cluster to handle that, so maybe the problems mentioned above
are negligible.

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Yes, your assumption is correct. Deleting individual docs means
searching them all and marking them as deleted, so they can be purged in
the next merge (which will make merging more expensive as well). Deleting
indices is pretty much like deleting files on the disk.

If you want to have only one index per day, it might be worth to put
each user's logs in a different type. The type name is essentially just
another field (so no performance gain), but it's easier to index, search,
and delete when you're using the type name in the URL.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alex,

I am fine if my purging routines are bit slow but if it takes order few
hours to delete data then it may not be a feasible one. However, memory is
an important consideration. Storing messages from all my users in one index
(per day) may not save memory that much if the docs are completely
different, no?

On Fri, May 10, 2013 at 12:50 AM, Alexander Reelsen alr@spinscale.dewrote:

Hey vinod,

you are right, that the type is just another document field. Deleting by
type still means you delete data from an index instead of deleting the
whole index in your date+user index name example, which might be the better
way to go. Having indices per user and date comes at the cost of having to
open more file descriptors and use memory per index, but might save you a
lot time with your housekeeping routines, especially if you only need to
hold the data for a limited amount of time (if you got enough disk space it
might also be an option to just close the index, instead of deleting it, so
it does not eat up any other resources than diskspace).

--Alex

On Thu, May 9, 2013 at 8:24 PM, vinod eligeti veligeti999@gmail.comwrote:

Radu,

As per my understanding as what you said for example, if I have User1and
User2 as 2 doc types, Its much faster to delete the a single doc type,
say User1, than to have one doc type say Log and in that have a field
user which can hold values User1 and User2, is that correct?

I ES Guide section, doc type is just any other field of a document and
defined by *_type, implicit field, if thats the case why would deletion
a doc type is faster than user field as deleting doc type internally should
have mapped to _type field which is same as bulk delete query of *
user=User1
?

I hope I am not confusing you with my example.

thanks

On Thursday, May 9, 2013 12:27:28 AM UTC-7, Radu Gheorghe wrote:

Hello,

On Thu, May 9, 2013 at 2:25 AM, vinod eligeti velig...@gmail.comwrote:

Hi,

Let me explain the problem which I am trying to address:

We have a system to store logs per user and each user produces around
few Terabytes of logs per day. I have growing number of users, as per
guidelines I can split the indices per day or per user however, I am
thinking to create indices per account and per day so the index name
looks like

:

that means the number of indices will be (users * no_of_days). So
question begs is how many indices ES can support?

I'm not aware of such a limit. However, more indices imply more shards,
and each shard comes with a memory overhead. And, assuming you keep the
same merge policies, having more shards means more segments - so your data
will be less compact (more disk space used), and you'll need more open
files.

Since you say users send terabytes of logs per day, I assume you have a
pretty large cluster to handle that, so maybe the problems mentioned above
are negligible.

The reason why i am creating index per user and per day is because of
purging data policies. Each user logs can retained for certain number of
days. So I assumed that its easier to delete whole index pertaining to that
user and day. Other approach I can use is just create indices based on days
and have log data store the username as one of the fields in document type
and issue a query to bulk delete the logs with timestamp greater than
retention period. I assume that the later approach is much expensive than
deleting just the index. Is my assumption correct?

Yes, your assumption is correct. Deleting individual docs means
searching them all and marking them as deleted, so they can be purged in
the next merge (which will make merging more expensive as well). Deleting
indices is pretty much like deleting files on the disk.

If you want to have only one index per day, it might be worth to put
each user's logs in a different type. The type name is essentially just
another field (so no performance gain), but it's easier to index, search,
and delete when you're using the type name in the URL.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pgye6YlNXmI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.