Index Size explosion (17 GB -> 840 GB)

Hi ,

I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

can you give us more insight in what your script does?

On Monday, March 25, 2013 12:13:22 PM UTC+1, Vineeth Mohan wrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adding attachments.

PFA

Thanks
Vineeth

On Mon, Mar 25, 2013 at 5:51 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

File size dump of the previus and present index

Previous - gist:5237294 · GitHub (Single shard ,
single machine)
Present - gist:5237236 · GitHub (4 shard , 4
machine)

On Mon, Mar 25, 2013 at 5:52 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Adding attachments.

PFA

Thanks
Vineeth

On Mon, Mar 25, 2013 at 5:51 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan <vineethmohan@algotree.com

wrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I would guess that the reason for this change in shard sizes is dramatic
increase in the average size of the _source field of your documents. I
would suggest checking sources for documents before update and after update
to see what changed. Maybe update process didn't go as expected.

On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:

File size dump of the previus and present index

Previous - gist:5237294 · GitHub (Single shard ,
single machine)
Present - gist:5237236 · GitHub (4 shard , 4
machine)

On Mon, Mar 25, 2013 at 5:52 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Adding attachments.

PFA

Thanks
Vineeth

On Mon, Mar 25, 2013 at 5:51 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Not saying this happened to you, but I've had bugs in update scripts before
that recursively include the source in the update. So when a document is
updated, I accidentally include the old _source as a field. The next time
the doc is updated, the _source is again included (which now includes two
copies of the old source), etc etc.

If you do that enough, the size quickly spirals out of control. Definitely
check your script to make sure it is doing what you think it is.

-Zach

On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:

File size dump of the previus and present index

Previous - gist:5237294 · GitHub (Single shard ,
single machine)
Present - gist:5237236 · GitHub (4 shard , 4
machine)

On Mon, Mar 25, 2013 at 5:52 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Adding attachments.

PFA

Thanks
Vineeth

On Mon, Mar 25, 2013 at 5:51 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan <vineet...@algotree.com<javascript:>

wrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@Zach - Yeah dude , that is what happened to me also.

I expected update script to replace a specific field like it do if its a
single value field. But if the field is a complex associated variable with
hash and arrays , it is actually merging the field with the old one. Hence
the size explosion.

Thanks @igor for all your help.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 11:16 PM, Zachary Tong zacharyjtong@gmail.comwrote:

Not saying this happened to you, but I've had bugs in update scripts
before that recursively include the source in the update. So when a
document is updated, I accidentally include the old _source as a field.
The next time the doc is updated, the _source is again included (which now
includes two copies of the old source), etc etc.

If you do that enough, the size quickly spirals out of control.
Definitely check your script to make sure it is doing what you think it
is.

-Zach

On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:

File size dump of the previus and present index

Previous - https://gist.github.com/**Vineeth-Mohan/5237294https://gist.github.com/Vineeth-Mohan/5237294(Single shard , single machine)
Present - https://gist.github.com/**Vineeth-Mohan/5237236https://gist.github.com/Vineeth-Mohan/5237236(4 shard , 4 machine)

On Mon, Mar 25, 2013 at 5:52 PM, Vineeth Mohan vineet...@algotree.comwrote:

Adding attachments.

PFA

Thanks
Vineeth

On Mon, Mar 25, 2013 at 5:51 PM, Vineeth Mohan vineet...@algotree.comwrote:

Adding some more info.
The number of replica is 0.

Also please find the before and after images of the head plugin
attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.

Thanks
Vineeth

On Mon, Mar 25, 2013 at 4:43 PM, Vineeth Mohan vineet...@algotree.comwrote:

Hi ,

I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same
mapping. These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.

I am not able to comprehend hat happened.
Kindly shed some light on this issue.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.