Deleting old versions


(Hari Shankar) #1

Hi,

How do you delete older versions of documents? Suppose I always need only
the latest version, I wouldn't want to waste space storing the earlier
versions as well.

Hari


(Clinton Gormley) #2

Hi Hari

How do you delete older versions of documents? Suppose I always need
only the latest version, I wouldn't want to waste space storing the
earlier versions as well.

One of three scenarios here:

  1. You are using the same ID for each version of the doc

In this case, each time you index a doc with the same ID, the old
version will be marked as deleted, and will no longer be visible.

These 'deleted' docs will be automatically removed at some point in the
future when the "segments" in your index are merged. This happens in
the background, but can be manually triggered using the optimize API.

  1. You are using different IDs for each version of the doc, or no ID, in
    which case ES generates one for you.

In this case, you are out of luck. ES can only identify that one docs
is a different version of another doc by looking at the ID.

Instead, you would have to have some way of identifying which docs are
current, and which are old, and you can use the delete_by_query API to
mark the old ones as deleted.

  1. You have a rolling window

For instance, you're indexing log statements, and you want to have the
last week's data available to you, but automatically clear out anything
older.2

Easiest thing here would be to create a new index every day, eg
"logs_2011-06-23" and only insert into today's index.

For querying, you could create an alias eg "logs_week" which points to
"logs_2011-06-23", "logs_2011-06-22", etc

Then each day you would create a new index, update the alias, and delete
the index for $today-8 days

clint


(Hari Shankar) #3

Hi Clint,

Great.. I am basically case 1, but I saw that even though I don't specify
any ID in the indexing request, es automatically creates a new id, e.g,
version is increased to 2. But even after this, I saw that version 1 can be
accessed. So you mean to say that it won't be always accessible, i.e, the
version 1 will be visible for some time but will get removed automatically
after some time?

Hari

On Thu, Jun 23, 2011 at 2:33 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Hari

How do you delete older versions of documents? Suppose I always need
only the latest version, I wouldn't want to waste space storing the
earlier versions as well.

One of three scenarios here:

  1. You are using the same ID for each version of the doc

In this case, each time you index a doc with the same ID, the old
version will be marked as deleted, and will no longer be visible.

These 'deleted' docs will be automatically removed at some point in the
future when the "segments" in your index are merged. This happens in
the background, but can be manually triggered using the optimize API.

  1. You are using different IDs for each version of the doc, or no ID, in
    which case ES generates one for you.

In this case, you are out of luck. ES can only identify that one docs
is a different version of another doc by looking at the ID.

Instead, you would have to have some way of identifying which docs are
current, and which are old, and you can use the delete_by_query API to
mark the old ones as deleted.

  1. You have a rolling window

For instance, you're indexing log statements, and you want to have the
last week's data available to you, but automatically clear out anything
older.2

Easiest thing here would be to create a new index every day, eg
"logs_2011-06-23" and only insert into today's index.

For querying, you could create an alias eg "logs_week" which points to
"logs_2011-06-23", "logs_2011-06-22", etc

Then each day you would create a new index, update the alias, and delete
the index for $today-8 days

clint


(Clinton Gormley) #4

Hi Hari

Great.. I am basically case 1, but I saw that even though I don't
specify any ID in the indexing request, es automatically creates a new
id, e.g, version is increased to 2. But even after this, I saw that
version 1 can be accessed. So you mean to say that it won't be always
accessible, i.e, the version 1 will be visible for some time but will
get removed automatically after some time?

Hold on, you DON'T specify an ID, but the _version gets incremented?
That doesn't make much sense:

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1' -d '{"text" :"bar"}'

[Thu Jun 23 11:17:30 2011] Response:

{

"ok" : true,

"_index" : "foo",

"_id" : "rXlA07w5TN2dlE8jjH-LRg",

"_type" : "bar",

"_version" : 1

}

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1' -d '{"text" :"bar"}'

[Thu Jun 23 11:17:33 2011] Response:

{

"ok" : true,

"_index" : "foo",

"_id" : "pywWEDvqQoS_Oqmf509lkQ",

"_type" : "bar",

"_version" : 1

}

And if you DO specify the same ID, the older version should become
hidden (although it may take up to 1 second for that to happen)

If you're seeing something different, please provide a gist to
demonstrate

ta

clint


(Hari Shankar) #5

Sorry, I meant even if a don't specify a version. I put something with
id=id1, then I put something else with id1. Now when I do a GET, I get the
latest version of the doc, which is fine. I just want to know if there is
any way I can search/query the older version with the same id? Likewise, is
there a way I can get back a specific version of the doc?

Hari

On Thu, Jun 23, 2011 at 2:49 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Hari

Great.. I am basically case 1, but I saw that even though I don't
specify any ID in the indexing request, es automatically creates a new
id, e.g, version is increased to 2. But even after this, I saw that
version 1 can be accessed. So you mean to say that it won't be always
accessible, i.e, the version 1 will be visible for some time but will
get removed automatically after some time?

Hold on, you DON'T specify an ID, but the _version gets incremented?
That doesn't make much sense:

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1' -d '{"text" :"bar"}'

[Thu Jun 23 11:17:30 2011] Response:

{

"ok" : true,

"_index" : "foo",

"_id" : "rXlA07w5TN2dlE8jjH-LRg",

"_type" : "bar",

"_version" : 1

}

curl -XPOST 'http://127.0.0.1:9200/foo/bar?pretty=1' -d '{"text" :"bar"}'

[Thu Jun 23 11:17:33 2011] Response:

{

"ok" : true,

"_index" : "foo",

"_id" : "pywWEDvqQoS_Oqmf509lkQ",

"_type" : "bar",

"_version" : 1

}

And if you DO specify the same ID, the older version should become
hidden (although it may take up to 1 second for that to happen)

If you're seeing something different, please provide a gist to
demonstrate

ta

clint


(Clinton Gormley) #6

On Thu, 2011-06-23 at 15:53 +0530, Hari Shankar wrote:

Sorry, I meant even if a don't specify a version. I put something with
id=id1, then I put something else with id1. Now when I do a GET, I get
the latest version of the doc, which is fine. I just want to know if
there is any way I can search/query the older version with the same
id? Likewise, is there a way I can get back a specific version of the
doc?

I think the answer is: not without circumventing the API. You should
only be able to get the latest version back.

clint


(Shay Banon) #7

There is no way to get previous versions of the doc unless you store several versions of the doc yourself under different ids / types.

On Thursday, June 23, 2011 at 1:34 PM, Clinton Gormley wrote:

On Thu, 2011-06-23 at 15:53 +0530, Hari Shankar wrote:

Sorry, I meant even if a don't specify a version. I put something with
id=id1, then I put something else with id1. Now when I do a GET, I get
the latest version of the doc, which is fine. I just want to know if
there is any way I can search/query the older version with the same
id? Likewise, is there a way I can get back a specific version of the
doc?

I think the answer is: not without circumventing the API. You should
only be able to get the latest version back.

clint


(system) #8