Mass delete by query


(slushi) #1

I have varying data retention requirements I am trying to balance (I am
continuously indexing new documents):

  • 1% of my documents need to be kept forever
  • 10% need to be kept 1 year
  • the remainder needs to be kept for 1 month

I can easily set properties indicating the retention policy for each
document and then periodically do a "delete by query". However, since the
delete would remove 89% of the indexed documents, would there be any
potential performance problems with this straightforward approach? I guess
this is a YMMV type thing, but I was just wondering what the typical
approach is here. Would it be necessary to perhaps filter the query to not
affect so many documents at once? Would query performance be greatly
impacted?

The alternate approach I was thinking would be to create separate indices
for each retention type. Cleanup would be easier, but unfortunately a
document's retention policy can be upgraded/downgraded so that could be a
little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/672e4d70-b9f9-4f6c-b22e-4287ef5a27ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kevin Wang) #2

Why not use TTL for
document? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

On Tuesday, April 1, 2014 8:50:14 AM UTC+11, slushi wrote:

I have varying data retention requirements I am trying to balance (I am
continuously indexing new documents):

  • 1% of my documents need to be kept forever
  • 10% need to be kept 1 year
  • the remainder needs to be kept for 1 month

I can easily set properties indicating the retention policy for each
document and then periodically do a "delete by query". However, since the
delete would remove 89% of the indexed documents, would there be any
potential performance problems with this straightforward approach? I guess
this is a YMMV type thing, but I was just wondering what the typical
approach is here. Would it be necessary to perhaps filter the query to not
affect so many documents at once? Would query performance be greatly
impacted?

The alternate approach I was thinking would be to create separate indices
for each retention type. Cleanup would be easier, but unfortunately a
document's retention policy can be upgraded/downgraded so that could be a
little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eefba11c-d147-4e02-b84b-bc8f90a08e3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(slushi) #3

I attended an elastic search meet up and at some point it was mentioned
that TTL use is discouraged, but yes this would make a lot of sense here.
Also the 1 year thing is really a guesstimate, we want to keep as much of
that data as possible. I guess maybe with TTL you may not have as much
control when the document deletion and possible segment merging? I am not
that familiar with elastic search performance stuff yet (we just started
looking into using ES).

On Monday, March 31, 2014 5:52:28 PM UTC-4, Kevin Wang wrote:

Why not use TTL for document?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

On Tuesday, April 1, 2014 8:50:14 AM UTC+11, slushi wrote:

I have varying data retention requirements I am trying to balance (I am
continuously indexing new documents):

  • 1% of my documents need to be kept forever
  • 10% need to be kept 1 year
  • the remainder needs to be kept for 1 month

I can easily set properties indicating the retention policy for each
document and then periodically do a "delete by query". However, since the
delete would remove 89% of the indexed documents, would there be any
potential performance problems with this straightforward approach? I guess
this is a YMMV type thing, but I was just wondering what the typical
approach is here. Would it be necessary to perhaps filter the query to not
affect so many documents at once? Would query performance be greatly
impacted?

The alternate approach I was thinking would be to create separate indices
for each retention type. Cleanup would be easier, but unfortunately a
document's retention policy can be upgraded/downgraded so that could be a
little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9b685cff-e956-473a-935e-9546b2ea59b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

If you know in advance which doc should be removed (i mean at index time), you should send the document to an index which should be entirely removed after a given period.

Makes sense?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 avr. 2014 à 00:00, slushi kireetreddy@gmail.com a écrit :

I attended an elastic search meet up and at some point it was mentioned that TTL use is discouraged, but yes this would make a lot of sense here. Also the 1 year thing is really a guesstimate, we want to keep as much of that data as possible. I guess maybe with TTL you may not have as much control when the document deletion and possible segment merging? I am not that familiar with elastic search performance stuff yet (we just started looking into using ES).

On Monday, March 31, 2014 5:52:28 PM UTC-4, Kevin Wang wrote:
Why not use TTL for document? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

On Tuesday, April 1, 2014 8:50:14 AM UTC+11, slushi wrote:
I have varying data retention requirements I am trying to balance (I am continuously indexing new documents):
1% of my documents need to be kept forever
10% need to be kept 1 year
the remainder needs to be kept for 1 month
I can easily set properties indicating the retention policy for each document and then periodically do a "delete by query". However, since the delete would remove 89% of the indexed documents, would there be any potential performance problems with this straightforward approach? I guess this is a YMMV type thing, but I was just wondering what the typical approach is here. Would it be necessary to perhaps filter the query to not affect so many documents at once? Would query performance be greatly impacted?

The alternate approach I was thinking would be to create separate indices for each retention type. Cleanup would be easier, but unfortunately a document's retention policy can be upgraded/downgraded so that could be a little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9b685cff-e956-473a-935e-9546b2ea59b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CF5F858B-18DF-499F-96C2-F21AE4CD13EC%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(slushi) #5

yes, unfortunately it’s not completely known at index time. I would need to
keep the separate indices in sync when a retention policy change occurs.
attempting this seems like it could open up a whole can of worms.

On Tuesday, April 1, 2014 1:58:04 AM UTC-4, David Pilato wrote:

If you know in advance which doc should be removed (i mean at index time),
you should send the document to an index which should be entirely removed
after a given period.

Makes sense?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 avr. 2014 à 00:00, slushi <kiree...@gmail.com <javascript:>> a
écrit :

I attended an elastic search meet up and at some point it was mentioned
that TTL use is discouraged, but yes this would make a lot of sense here.
Also the 1 year thing is really a guesstimate, we want to keep as much of
that data as possible. I guess maybe with TTL you may not have as much
control when the document deletion and possible segment merging? I am not
that familiar with elastic search performance stuff yet (we just started
looking into using ES).

On Monday, March 31, 2014 5:52:28 PM UTC-4, Kevin Wang wrote:

Why not use TTL for document?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

On Tuesday, April 1, 2014 8:50:14 AM UTC+11, slushi wrote:

I have varying data retention requirements I am trying to balance (I am
continuously indexing new documents):

  • 1% of my documents need to be kept forever
  • 10% need to be kept 1 year
  • the remainder needs to be kept for 1 month

I can easily set properties indicating the retention policy for each
document and then periodically do a "delete by query". However, since the
delete would remove 89% of the indexed documents, would there be any
potential performance problems with this straightforward approach? I guess
this is a YMMV type thing, but I was just wondering what the typical
approach is here. Would it be necessary to perhaps filter the query to not
affect so many documents at once? Would query performance be greatly
impacted?

The alternate approach I was thinking would be to create separate
indices for each retention type. Cleanup would be easier, but unfortunately
a document's retention policy can be upgraded/downgraded so that could be a
little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9b685cff-e956-473a-935e-9546b2ea59b3%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9b685cff-e956-473a-935e-9546b2ea59b3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eec089d7-0cef-4a9b-b53f-7dce55ad2bfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #6

In your use case, could the retention policy change for 89% document?
If not, I would create one index for documents which could have a moving retention policy and use _ttl. For monthly docs, I would use an index per month.

If it's not the case, I think you should deal with _ttl with a cost of higher merges.

My 2 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 avr. 2014 à 08:03, slushi kireetreddy@gmail.com a écrit :

yes, unfortunately it’s not completely known at index time. I would need to keep the separate indices in sync when a retention policy change occurs. attempting this seems like it could open up a whole can of worms.

On Tuesday, April 1, 2014 1:58:04 AM UTC-4, David Pilato wrote:
If you know in advance which doc should be removed (i mean at index time), you should send the document to an index which should be entirely removed after a given period.

Makes sense?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 avr. 2014 à 00:00, slushi kiree...@gmail.com a écrit :

I attended an elastic search meet up and at some point it was mentioned that TTL use is discouraged, but yes this would make a lot of sense here. Also the 1 year thing is really a guesstimate, we want to keep as much of that data as possible. I guess maybe with TTL you may not have as much control when the document deletion and possible segment merging? I am not that familiar with elastic search performance stuff yet (we just started looking into using ES).

On Monday, March 31, 2014 5:52:28 PM UTC-4, Kevin Wang wrote:
Why not use TTL for document? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

On Tuesday, April 1, 2014 8:50:14 AM UTC+11, slushi wrote:
I have varying data retention requirements I am trying to balance (I am continuously indexing new documents):
1% of my documents need to be kept forever
10% need to be kept 1 year
the remainder needs to be kept for 1 month
I can easily set properties indicating the retention policy for each document and then periodically do a "delete by query". However, since the delete would remove 89% of the indexed documents, would there be any potential performance problems with this straightforward approach? I guess this is a YMMV type thing, but I was just wondering what the typical approach is here. Would it be necessary to perhaps filter the query to not affect so many documents at once? Would query performance be greatly impacted?

The alternate approach I was thinking would be to create separate indices for each retention type. Cleanup would be easier, but unfortunately a document's retention policy can be upgraded/downgraded so that could be a little messy to keep consistent.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9b685cff-e956-473a-935e-9546b2ea59b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eec089d7-0cef-4a9b-b53f-7dce55ad2bfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80DF8D6F-E0E8-46F3-BA7D-0D76D1B11E45%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(system) #7