Shard size

I've been trying to optimize ES for logstash; threads, indexing memory, number of shards, compression, etc and I'm stuck on disk usage. As a test I indexed 4GB of IIS logs into a two shard index on a two node cluster. Each shard has one replica. I'm seeing that each shard is 4GB and each replica is 4GB (fine, that's a given, it's a replica).

In really confused as to why each shard is 4GB. Shouldn't both shards equate to 4GB, not 8GB? At this point, 4GB of logs is equating to 16GB of used storage across two nodes.

routing: {
state: STARTED
primary: true
node: eF-H3zhSTI6piq_f6ukjtA
relocating_node: null
shard: 0
index: logstash-2013.05.27
}
state: STARTED
index: {
size: 3.8gb
size_in_bytes: 4157721500
}m

routing: {
state: STARTED
primary: true
node: eF-H3zhSTI6piq_f6ukjtA
relocating_node: null
shard: 1
index: logstash-2013.05.27
}
state: STARTED
index: {
size: 3.9gb
size_in_bytes: 4191907329
}

-confused

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi pleitao,

You said that you indexed 4GB logs and disk space by indices was 16GB. As
you have 1 replica, it means that in addition to the primary shards, you'll
also have replicas. (Hence, your primary shards occupy 8GB and your
replicas 8GB). The output that you've pasted shows that those two shards
are primary, so they don't include the replica, hence 3.9GB + 3.9GB = 8GB
which makes sense.

Hope it helps.

On Mon, Jun 3, 2013 at 6:01 AM, pleitao@gmail.com wrote:

I've been trying to optimize ES for logstash; threads, indexing memory,
number of shards, compression, etc and I'm stuck on disk usage. As a test I
indexed 4GB of IIS logs into a two shard index on a two node cluster. Each
shard has one replica. I'm seeing that each shard is 4GB and each replica
is 4GB (fine, that's a given, it's a replica).

In really confused as to why each shard is 4GB. Shouldn't both shards
equate to 4GB, not 8GB? At this point, 4GB of logs is equating to 16GB of
used storage across two nodes.

routing: {
state: STARTED
primary: true
node: eF-H3zhSTI6piq_f6ukjtA
relocating_node: null
shard: 0
index: logstash-2013.05.27
}
state: STARTED
index: {
size: 3.8gb
size_in_bytes: 4157721500
}m

routing: {
state: STARTED
primary: true
node: eF-H3zhSTI6piq_f6ukjtA
relocating_node: null
shard: 1
index: logstash-2013.05.27
}
state: STARTED
index: {
size: 3.9gb
size_in_bytes: 4191907329
}

-confused

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards,
Abhijeet Rastogi (shadyabhi)
http://blog.abhijeetr.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Abhijeet,
Shouldn't shard 0 and shard 1 (the primary shards) equate to 4GB? I thought data was spread out across all shards, not duplicated on all shards.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

As you said: At this point, 4GB of logs is equating to 16GB of used storage
across two nodes.

This perfectly makes sense. As you've 2 primary shards, 1 replica
configuration, you actually have 4 shards.

Now, as 4GB converts to 16GB, this 16Gb is actually 8GB + 8GB. (as we have
one replica). Now, as you've 2 primary shards, each 8GB is actually 4GB +
4GB.

I'm not where I don't make sense. Also, there is no duplication of data
among primary shards.

On Mon, Jun 3, 2013 at 4:16 PM, pleitao@gmail.com wrote:

Thanks Abhijeet,
Shouldn't shard 0 and shard 1 (the primary shards) equate to 4GB? I
thought data was spread out across all shards, not duplicated on all shards.

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards,
Abhijeet Rastogi (shadyabhi)
http://blog.abhijeetr.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think what the OP is asking is why does 4GBs of raw data end up being 8GB
(non-replicated).

Besides the size of the raw data, you need to account for the size of the
field indexes as well. You can reduce the size by setting
non-queried/filtered fields to be not indexed and not storing the fields.
Fields are not stored by default, but they indexed by default. Also be sure
that numeric fields are indexed as numerics and not strings. Use smaller
numeric types whenever possible (ints/shorts over longs, float over
double), but the most efficient savings is not indexing string content.

Another source of potential savings is to omit term frequencies and/or
norms. Read some Lucene documentation to understand what they do.

--
Ivan

On Mon, Jun 3, 2013 at 5:06 AM, Abhijeet Rastogi abhijeet.1989@gmail.comwrote:

As you said: At this point, 4GB of logs is equating to 16GB of used
storage across two nodes.

This perfectly makes sense. As you've 2 primary shards, 1 replica
configuration, you actually have 4 shards.

Now, as 4GB converts to 16GB, this 16Gb is actually 8GB + 8GB. (as we have
one replica). Now, as you've 2 primary shards, each 8GB is actually 4GB +
4GB.

I'm not where I don't make sense. Also, there is no duplication of data
among primary shards.

On Mon, Jun 3, 2013 at 4:16 PM, pleitao@gmail.com wrote:

Thanks Abhijeet,
Shouldn't shard 0 and shard 1 (the primary shards) equate to 4GB? I
thought data was spread out across all shards, not duplicated on all shards.

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards,
Abhijeet Rastogi (shadyabhi)
http://blog.abhijeetr.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.