Compression flag is not wrking


(Anita) #1

Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search log data.
There will be few hundred tera bytes of log data.

TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message fields, elasticsearch writes few more fields, but how come elasticsearch data size 3 times of original data size? Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

 // source cntains log message. It has 12 fields. Log message is about 1K. 

}

TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du -sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.


(Michael Sick) #2

Do you have any fields included in the _source? Compression does not happen
on the index overall, just those fields included in the _source field.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html
The _source field is an automatically generated field that stores the
actual JSON that was used as the indexed document.

...

Includes / Excludes

Allow to specify paths in the source that would be included / excluded when
its stored, supporting * as wildcard annotation.

On Fri, Apr 20, 2012 at 6:49 PM, Anita narayana_anita@hotmail.com wrote:

Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search
log data.
There will be few hundred tera bytes of log data.

TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G
of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message
fields, elasticsearch writes few more fields, but how come elasticsearch
data size 3 times of original data size? Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

// source cntains log message. It has 12 fields. Log message is about

1K.

}

TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du
-sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/compression-flag-is-not-wrking-tp3927247p3927247.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Igor Motov) #3

_source is a field, so it should configured in the mapping like this:

curl -XPUT http://localhost:9200/order/trade/_mapping -d '{
"trade" : {
"_source" : { "compress" : true }
}
}'

On Wednesday, April 25, 2012 8:24:31 AM UTC-4, Michael Sick wrote:

Do you have any fields included in the _source? Compression does not
happen on the index overall, just those fields included in the _source
field.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html
The _source field is an automatically generated field that stores the
actual JSON that was used as the indexed document.

...

Includes / Excludes

Allow to specify paths in the source that would be included / excluded
when its stored, supporting * as wildcard annotation.

On Fri, Apr 20, 2012 at 6:49 PM, Anita narayana_anita@hotmail.com wrote:

Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and
search
log data.
There will be few hundred tera bytes of log data.

TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G
of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message
fields, elasticsearch writes few more fields, but how come elasticsearch
data size 3 times of original data size? Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

// source cntains log message. It has 12 fields. Log message is about

1K.

}

TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du
-sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/compression-flag-is-not-wrking-tp3927247p3927247.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #4