A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
I've already been wondering why I couldn't send no more message to the list
Well anyway here is what I wanted to post:
+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.
Heya,
Yes, its possible to add this feature. I think there is already an issue
open for something similar... . Would love to hear what other people
think...
-shay.banon
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
Yea, annoyed by it :). Btw, wanted to post another important note on
managing expiring data. Another way of doing it, assuming it applies to the
usecase, it to create an index per timespan, and then expire data by simply
deleting old indices. The benefit of this usage pattern is the fact that
deleting an index is much faster and has less strain on the system then
deleting specific documents from an index (which will have to be merged
out).
For example, you could index log data into a single index, and have a ttl
for it of 2 weeks. A better solution would be to create an index per week,
and delete old indices that pass the 2 weeks mark.
I've already been wondering why I couldn't send no more message to the list
Well anyway here is what I wanted to post:
+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.
Heya,
Yes, its possible to add this feature. I think there is already an
issue
open for something similar... . Would love to hear what other people
think...
-shay.banon
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
Thats exactly how I'm doing it at the moment. Although I think it
would in some cases be convenient to have the possibility to specify a
ttl while indexing, for instance if your expiring data is unregulary
and sparsely on the time range. In this case it would be difficult to
specify the timerange of the different indices, and if you want to,
say keep docs of a maximum age of 6 month, I think it would be nice to
specify the ttl for the docs while indexing, instead of periodically
iterating over the results cleaning up manually.
On Wed, Jul 27, 2011 at 10:41 AM, Shay Banon kimchy@gmail.com wrote:
Yea, annoyed by it :). Btw, wanted to post another important note on
managing expiring data. Another way of doing it, assuming it applies to the
usecase, it to create an index per timespan, and then expire data by simply
deleting old indices. The benefit of this usage pattern is the fact that
deleting an index is much faster and has less strain on the system then
deleting specific documents from an index (which will have to be merged
out).
For example, you could index log data into a single index, and have a ttl
for it of 2 weeks. A better solution would be to create an index per week,
and delete old indices that pass the 2 weeks mark.
-shay.banon
I've already been wondering why I couldn't send no more message to the
list
Well anyway here is what I wanted to post:
+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.
Heya,
Yes, its possible to add this feature. I think there is already an
issue
open for something similar... . Would love to hear what other people
think...
-shay.banon
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
Yeah I agree on the index per timespan approach, it is an efficient way to
handle expiring data for logs and things like that but it doesn't really fit
my use cases:
the user still has to manage expired indices deletion by external jobs
which is of course easy but not really nice
it is not really flexible because you have to choose your time range a
priori. In my use case I would like to be able to dynamically change the TTL
of an indexed doc and I don't want to have to delete it from an index and
reindex it to another one fitting the new TTL
that can lead to a lot of indices (think for example one index/user +
divide each index by time range...) which add an overhead
If there are other people interested and if we can agree here to a good way
to implement it I am quite willing to implement it. Kimchy do you have
special recommendations, concerns about the implementation?
I think we identified two different implementations:
The first, is one that I have been thinking for a long time, and its
automatic rolling of indices. Basically, utilizing the index templates
notion, one can define an index rolling strategy (time based can be the
first one). When indexing, we can check if a rollover is needed, and if so,
we can create a new index and index the data into it. The fact that its
built on top of index templates will automatically support custom settings
and mappings for the indices created.
This one can touch on several places in elasticsearch, and can have the
additional features:
Automatic index naming based on rollover strategy (week / day / ...)
Automatically delete old indices, where old is defined in the rollover
strategy.
Automatic setting of aliases. For example, an "indexing" alias and
"search" alias, as well as possible additional search aliases ("last_week",
"last_month").
TTL per document in the index. That one is a bit more tricky as it
requires to think where the TTL will be stored. It can be stored in the
document, but then it requires reindexing whenever it changes. It will also
require a process that periodically evicts old documents.
If there are other people interested and if we can agree here to a good way
to implement it I am quite willing to implement it. Kimchy do you have
special recommendations, concerns about the implementation?
I have docs in CouchDB which have publication and expiry timestamps in them.
I expose Elasticsearch as a query layer to users for these docs.
I have a celery (python) job which keeps syncing (POST/DELETE) these docs to
Elasticsearch at appropriate times.
Typically there is a 2 - 5 minute delay in these operations ( pubish_time +
x_minutes ). It is OK for me to publish the doc to Elasticsearch with a
delay, but withdrawing with a delay is a bit painful.
If a _ttl field is supported, it will make withdrawing docs easier and
almost realtime.
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.
Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.