Compression flag is not working.
ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3
We are planning to use ElasticSearch and Hadoop gateway to store and search log data.
There will be few hundred tera bytes of log data.
TEST1:
Index: order
INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'
Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)
Each log message is about 1K.
I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log ( Non compressed 380M)
Data written on the hdfs : 1304717764 bytes (1.2G)
Raw data size is about 380M, how come elasticsearch is written about 1.2 G of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message fields, elasticsearch writes few more fields, but how come elasticsearch data size 3 times of original data size? Am I missing any thing here?
Each document contains
_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {
// source cntains log message. It has 12 fields. Log message is about 1K.
}
TEST2:
I tested with/without compression.
Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)
Compression Test
INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}
index._source.compress: true
order
size: 2.2gb (6.7gb)
docs: 627336 (627336)
Same log data writing to a file:
685678405 Apr 20 10:44 test_es.log ( size 685M)
Raw data is only about 685M
Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)
690077730 Apr 20 12:45 test_es.log (SIZE 690M)
I do not see any difference in data size w/o compression, when I run "du -sh" on the data directory.
is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress_threshold" : 10000} }}
I really appreciate any help on this.
Please give me some guidelines how to scale large data.