TTL for documents

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

Heya,

Yes, its possible to add this feature. I think there is already an issue
open for something similar... . Would love to hear what other people
think...

-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze <benjamin.deveze@gmail.com

wrote:

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

I've already been wondering why I couldn't send no more message to the list :wink:
Well anyway here is what I wanted to post:

+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.

Best,
Michel

On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
Yes, its possible to add this feature. I think there is already an issue
open for something similar... . Would love to hear what other people
think...
-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
benjamin.deveze@gmail.com wrote:

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

Yea, annoyed by it :). Btw, wanted to post another important note on
managing expiring data. Another way of doing it, assuming it applies to the
usecase, it to create an index per timespan, and then expire data by simply
deleting old indices. The benefit of this usage pattern is the fact that
deleting an index is much faster and has less strain on the system then
deleting specific documents from an index (which will have to be merged
out).

For example, you could index log data into a single index, and have a ttl
for it of 2 weeks. A better solution would be to create an index per week,
and delete old indices that pass the 2 weeks mark.

-shay.banon

On Wed, Jul 27, 2011 at 11:22 AM, Michel Conrad <
michel.conrad@trendiction.com> wrote:

I've already been wondering why I couldn't send no more message to the list
:wink:
Well anyway here is what I wanted to post:

+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.

Best,
Michel

On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
Yes, its possible to add this feature. I think there is already an
issue
open for something similar... . Would love to hear what other people
think...
-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
benjamin.deveze@gmail.com wrote:

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

Thats exactly how I'm doing it at the moment. Although I think it
would in some cases be convenient to have the possibility to specify a
ttl while indexing, for instance if your expiring data is unregulary
and sparsely on the time range. In this case it would be difficult to
specify the timerange of the different indices, and if you want to,
say keep docs of a maximum age of 6 month, I think it would be nice to
specify the ttl for the docs while indexing, instead of periodically
iterating over the results cleaning up manually.

On Wed, Jul 27, 2011 at 10:41 AM, Shay Banon kimchy@gmail.com wrote:

Yea, annoyed by it :). Btw, wanted to post another important note on
managing expiring data. Another way of doing it, assuming it applies to the
usecase, it to create an index per timespan, and then expire data by simply
deleting old indices. The benefit of this usage pattern is the fact that
deleting an index is much faster and has less strain on the system then
deleting specific documents from an index (which will have to be merged
out).
For example, you could index log data into a single index, and have a ttl
for it of 2 weeks. A better solution would be to create an index per week,
and delete old indices that pass the 2 weeks mark.
-shay.banon

On Wed, Jul 27, 2011 at 11:22 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

I've already been wondering why I couldn't send no more message to the
list :wink:
Well anyway here is what I wanted to post:

+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.

Best,
Michel

On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
Yes, its possible to add this feature. I think there is already an
issue
open for something similar... . Would love to hear what other people
think...
-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
benjamin.deveze@gmail.com wrote:

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

Yeah I agree on the index per timespan approach, it is an efficient way to
handle expiring data for logs and things like that but it doesn't really fit
my use cases:

  • the user still has to manage expired indices deletion by external jobs
    which is of course easy but not really nice
  • it is not really flexible because you have to choose your time range a
    priori. In my use case I would like to be able to dynamically change the TTL
    of an indexed doc and I don't want to have to delete it from an index and
    reindex it to another one fitting the new TTL
  • that can lead to a lot of indices (think for example one index/user +
    divide each index by time range...) which add an overhead

If there are other people interested and if we can agree here to a good way
to implement it I am quite willing to implement it. Kimchy do you have
special recommendations, concerns about the implementation?

I think we identified two different implementations:

  1. The first, is one that I have been thinking for a long time, and its
    automatic rolling of indices. Basically, utilizing the index templates
    notion, one can define an index rolling strategy (time based can be the
    first one). When indexing, we can check if a rollover is needed, and if so,
    we can create a new index and index the data into it. The fact that its
    built on top of index templates will automatically support custom settings
    and mappings for the indices created.

    This one can touch on several places in elasticsearch, and can have the
    additional features:

    • Automatic index naming based on rollover strategy (week / day / ...)
    • Automatically delete old indices, where old is defined in the rollover
      strategy.
    • Automatic setting of aliases. For example, an "indexing" alias and
      "search" alias, as well as possible additional search aliases ("last_week",
      "last_month").
  2. TTL per document in the index. That one is a bit more tricky as it
    requires to think where the TTL will be stored. It can be stored in the
    document, but then it requires reindexing whenever it changes. It will also
    require a process that periodically evicts old documents.

On Wed, Jul 27, 2011 at 3:27 PM, Benjamin Devèze
benjamin.deveze@gmail.comwrote:

If there are other people interested and if we can agree here to a good way
to implement it I am quite willing to implement it. Kimchy do you have
special recommendations, concerns about the implementation?

(Sending to new mailing list)

+1 to this feature. It will help in my scenario.

I have docs in CouchDB which have publication and expiry timestamps in them.
I expose Elasticsearch as a query layer to users for these docs.
I have a celery (python) job which keeps syncing (POST/DELETE) these docs to
Elasticsearch at appropriate times.

Typically there is a 2 - 5 minute delay in these operations ( pubish_time +
x_minutes ). It is OK for me to publish the doc to Elasticsearch with a
delay, but withdrawing with a delay is a bit painful.

If a _ttl field is supported, it will make withdrawing docs easier and
almost realtime.

Regards,
Mahendra

On Wed, Jul 27, 2011 at 9:44 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Heya,

Yes, its possible to add this feature. I think there is already an issue
open for something similar... . Would love to hear what other people
think...

-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze <
benjamin.deveze@gmail.com> wrote:

A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

--
Mahendra

http://twitter.com/mahendra