Index rolling window of data


(Ashish Nigam) #1

HI,
I need help/suggestions in defining the right indexing/data migration strategy.
I have to create tenant based index storage. My use case covers around 100 tenants.
I need to allow user to search data for last 30 days. So that would mean that I would need to create a rolling indexes of 30 days. My estimate is that there will be around 4 million rows per tenant in 30 days.

I am thinking of creating an index per tenant per day and maintain an alias for 30 days. But that would mean 100 * 30 = 3000 active indexes at any point of time. Is this a huge number of indexes compared to amount of data that needs to be managed?

Another way can be to maintain just one index per tenant and move out a day worth of data every day so that active alias always have 30 days worth of data.

Please suggest me the right strategy to move forward here. If second option is better, I don't know the right way to selectively move a day's worth of data from an index to some another place.

Thanks
Ashish


(Otis Gospodnetić) #2

Hi,

With so few docs/tenant/30 days I'd use just 1 index/tenant, add all
content to it, and remove docs > 30 days old from it on a nightly basis.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, June 21, 2012 7:15:10 PM UTC-4, Ashish Nigam wrote:

HI,
I need help/suggestions in defining the right indexing/data migration
strategy.
I have to create tenant based index storage. My use case covers around 100
tenants.
I need to allow user to search data for last 30 days. So that would mean
that I would need to create a rolling indexes of 30 days. My estimate is
that there will be around 4 million rows per tenant in 30 days.

I am thinking of creating an index per tenant per day and maintain an
alias for 30 days. But that would mean 100 * 30 = 3000 active indexes at
any point of time. Is this a huge number of indexes compared to amount of
data that needs to be managed?

Another way can be to maintain just one index per tenant and move out a
day worth of data every day so that active alias always have 30 days worth
of data.

Please suggest me the right strategy to move forward here. If second
option is better, I don't know the right way to selectively move a day's
worth of data from an index to some another place.

Thanks
Ashish


(Ashish Nigam) #3

Thanks Otis.
I would stick to one index/tenant.
To remove docs on nightly basis from the index, is there any good way to perform soft delete, i.e move a day worth of data from an index to another index?
I can probably search all the data, move it to another index and then delete it from original index. But if there is any other more efficient way that doesn't explicitly involve a client executing these three steps, that might be more efficient.

On Jun 22, 2012, at 1:19 PM, Otis Gospodnetic wrote:

Hi,

With so few docs/tenant/30 days I'd use just 1 index/tenant, add all content to it, and remove docs > 30 days old from it on a nightly basis.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, June 21, 2012 7:15:10 PM UTC-4, Ashish Nigam wrote:
HI,
I need help/suggestions in defining the right indexing/data migration strategy.
I have to create tenant based index storage. My use case covers around 100 tenants.
I need to allow user to search data for last 30 days. So that would mean that I would need to create a rolling indexes of 30 days. My estimate is that there will be around 4 million rows per tenant in 30 days.

I am thinking of creating an index per tenant per day and maintain an alias for 30 days. But that would mean 100 * 30 = 3000 active indexes at any point of time. Is this a huge number of indexes compared to amount of data that needs to be managed?

Another way can be to maintain just one index per tenant and move out a day worth of data every day so that active alias always have 30 days worth of data.

Please suggest me the right strategy to move forward here. If second option is better, I don't know the right way to selectively move a day's worth of data from an index to some another place.

Thanks
Ashish


(Otis Gospodnetić) #4

I don't think there is something built-in that does this. But it shouldn't
be too hard to write an app that queries your main index, gets all
to-be-deleted docs and uses their _source to index them to another index.
Might be a nice addition to ES or a standalone tool, as a matter of fact.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Friday, June 22, 2012 5:20:24 PM UTC-4, Ashish Nigam wrote:

Thanks Otis.
I would stick to one index/tenant.
To remove docs on nightly basis from the index, is there any good way to
perform soft delete, i.e move a day worth of data from an index to another
index?
I can probably search all the data, move it to another index and then
delete it from original index. But if there is any other more efficient way
that doesn't explicitly involve a client executing these three steps, that
might be more efficient.

On Jun 22, 2012, at 1:19 PM, Otis Gospodnetic wrote:

Hi,

With so few docs/tenant/30 days I'd use just 1 index/tenant, add all
content to it, and remove docs > 30 days old from it on a nightly basis.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, June 21, 2012 7:15:10 PM UTC-4, Ashish Nigam wrote:

HI,
I need help/suggestions in defining the right indexing/data migration
strategy.
I have to create tenant based index storage. My use case covers around
100 tenants.
I need to allow user to search data for last 30 days. So that would mean
that I would need to create a rolling indexes of 30 days. My estimate is
that there will be around 4 million rows per tenant in 30 days.

I am thinking of creating an index per tenant per day and maintain an
alias for 30 days. But that would mean 100 * 30 = 3000 active indexes at
any point of time. Is this a huge number of indexes compared to amount of
data that needs to be managed?

Another way can be to maintain just one index per tenant and move out a
day worth of data every day so that active alias always have 30 days worth
of data.

Please suggest me the right strategy to move forward here. If second
option is better, I don't know the right way to selectively move a day's
worth of data from an index to some another place.

Thanks
Ashish


(Alexander Reelsen) #5

Hi Ashish,

On Fri, Jun 22, 2012 at 11:20 PM, Ashish Nigam
ashish@skyhighnetworks.comwrote:

To remove docs on nightly basis from the index, is there any good way to
perform soft delete, i.e move a day worth of data from an index to another
index?

You might want to check the TTL feature for this, so your docs get
automatically deleted after 30 days.

http://www.elasticsearch.org/guide/reference/mapping/ttl-field.html

--Alexander


(Shay Banon) #6

Note, is considerably cheaper to delete an index than delete data from an
index.

Having 3000 shards is possible, but will "cost" you in terms of the number
of nodes you will need to start in order to support it. What I would do is
do index per day, and on that index, have several shards holding data for
all tenants, and using routing (the tenant "name") to direct specific
tenant data to a shard.

Here is the presentation I gave that explains it:
https://speakerdeck.com/u/kimchy/p/elasticsearch-big-data-search-analytics.

On Sun, Jun 24, 2012 at 9:51 AM, Alexander Reelsen alr@spinscale.de wrote:

Hi Ashish,

On Fri, Jun 22, 2012 at 11:20 PM, Ashish Nigam <ashish@skyhighnetworks.com

wrote:

To remove docs on nightly basis from the index, is there any good way to
perform soft delete, i.e move a day worth of data from an index to another
index?

You might want to check the TTL feature for this, so your docs get
automatically deleted after 30 days.

http://www.elasticsearch.org/guide/reference/mapping/ttl-field.html

--Alexander


(Ashish Nigam-2) #7

Thanks for your feedback Shay.
There can be few high traffic tenants where I would need to store as much as 20 million entries in a day.
As routing works on a single shard, will it be fine to store so much data on a single shard. I am still in investigation mode and do not have a cluster to verify the performance with respect to data volume and shard allocations.
If it is standard to store this much data on a single shard, I would move ahead with this assumption.

On Jun 25, 2012, at 5:14 AM, Shay Banon wrote:

Note, is considerably cheaper to delete an index than delete data from an index.

Having 3000 shards is possible, but will "cost" you in terms of the number of nodes you will need to start in order to support it. What I would do is do index per day, and on that index, have several shards holding data for all tenants, and using routing (the tenant "name") to direct specific tenant data to a shard.

Here is the presentation I gave that explains it: https://speakerdeck.com/u/kimchy/p/elasticsearch-big-data-search-analytics.

On Sun, Jun 24, 2012 at 9:51 AM, Alexander Reelsen alr@spinscale.de wrote:
Hi Ashish,

On Fri, Jun 22, 2012 at 11:20 PM, Ashish Nigam ashish@skyhighnetworks.com wrote:
To remove docs on nightly basis from the index, is there any good way to perform soft delete, i.e move a day worth of data from an index to another index?

You might want to check the TTL feature for this, so your docs get automatically deleted after 30 days.

http://www.elasticsearch.org/guide/reference/mapping/ttl-field.html

--Alexander


(system) #8