I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
can you give us more insight in what your script does?
On Monday, March 25, 2013 12:13:22 PM UTC+1, Vineeth Mohan wrote:
Hi ,
I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and ran
in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840 GB
together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of every
feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
I would guess that the reason for this change in shard sizes is dramatic
increase in the average size of the _source field of your documents. I
would suggest checking sources for documents before update and after update
to see what changed. Maybe update process didn't go as expected.
On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
Not saying this happened to you, but I've had bugs in update scripts before
that recursively include the source in the update. So when a document is
updated, I accidentally include the old _source as a field. The next time
the doc is updated, the _source is again included (which now includes two
copies of the old source), etc etc.
If you do that enough, the size quickly spirals out of control. Definitely
check your script to make sure it is doing what you think it is.
-Zach
On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same mapping.
These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
@Zach - Yeah dude , that is what happened to me also.
I expected update script to replace a specific field like it do if its a
single value field. But if the field is a complex associated variable with
hash and arrays , it is actually merging the field with the old one. Hence
the size explosion.
Not saying this happened to you, but I've had bugs in update scripts
before that recursively include the source in the update. So when a
document is updated, I accidentally include the old _source as a field.
The next time the doc is updated, the _source is again included (which now
includes two copies of the old source), etc etc.
If you do that enough, the size quickly spirals out of control.
Definitely check your script to make sure it is doing what you think it
is.
-Zach
On Monday, March 25, 2013 9:59:45 AM UTC-4, Vineeth Mohan wrote:
Adding some more info.
The number of replica is 0.
Also please find the before and after images of the head plugin
attached.
Kindly note that the number of feeds is same for both.
Also the operation i did on the index after migration was just update
requests on all the feeds.
I had an index which was initially of 17 GB. It had a single shard and
ran in a single machine.
Last week i migrated the data to a 4 shard index with the same
mapping. These 4 shards are distributed among 4 machines.
After migration , i ran a script which updates one of the field of
every feed.
Now after all the migration and updation , size of index is around 840
GB together each of the shard having around 250 GB of data.
I am not able to comprehend hat happened.
Kindly shed some light on this issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.