Compress : true flag


(Anurag Phadke) #1

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Shay Banon) #2

The flag should be set on the mapping of a type:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Anurag Phadke) #3

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

The flag should be set on the mapping of a
type: http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Shay Banon) #4

Thats strange..., that the data was not reduced. Anything in the logs? Can
you try and issue a flush to the index and see if it gets smaller (there
might still be open indexing "reader" against those old index files).

On Thu, Jul 28, 2011 at 8:03 PM, Anurag anurag.phadke@gmail.com wrote:

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

The flag should be set on the mapping of a
type:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Shay Banon) #5

Transaction logs are not removed only on restart. They are removed on a
flush (where a Lucene commit is executed, and a new transaction log is
created). You can issue an optimize with flush flag set to true, which will
flush post optimization (assuming you wait for them).

Once optimized and "sync" returns, are we sure the data are persisted?

What is "sync"? Data is persisted once it has been indexed, it has nothing
to do with optimization or flushing.

On Fri, Jul 29, 2011 at 1:20 PM, Olivier Favre olivier@yakaz.com wrote:

I saw that the transactions logs are only getting removed when restarting
ES.
If I remember well, they were deleted after an _optimize before 0.17.2.

Once optimized and "sync" returns, are we sure the data are persisted?
If so, we could flush old transactions logs?

--
Olivier Favre

www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Thats strange..., that the data was not reduced. Anything in the logs? Can
you try and issue a flush to the index and see if it gets smaller (there
might still be open indexing "reader" against those old index files).

On Thu, Jul 28, 2011 at 8:03 PM, Anurag anurag.phadke@gmail.com wrote:

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

The flag should be set on the mapping of a
type:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com
wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Shay Banon) #6

Yea, it seems like a problem, opened an issue for it:
https://github.com/elasticsearch/elasticsearch/issues/1180.

Still don't understand the sync question. Are you referring to when files
are fsync'ed? If so, that depends. Lucene fsync files on commit, and
elasticsearch will fsync the transaction log on changes.

On Fri, Jul 29, 2011 at 4:50 PM, Olivier Favre olivier@yakaz.com wrote:

I issued
curl -XPOST ..../index/_optimize
which didn't remove the transaction logs.
(I even think I issued _flush and _refresh manually).
And according to the docshttp://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html,
the default values for refresh, flush and wait_for_merge are true.

By sync, I meant the Unix command.

--
Olivier Favre

www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Transaction logs are not removed only on restart. They are removed on a
flush (where a Lucene commit is executed, and a new transaction log is
created). You can issue an optimize with flush flag set to true, which will
flush post optimization (assuming you wait for them).

Once optimized and "sync" returns, are we sure the data are persisted?

What is "sync"? Data is persisted once it has been indexed, it has nothing
to do with optimization or flushing.

On Fri, Jul 29, 2011 at 1:20 PM, Olivier Favre olivier@yakaz.com wrote:

I saw that the transactions logs are only getting removed when restarting
ES.
If I remember well, they were deleted after an _optimize before 0.17.2.

Once optimized and "sync" returns, are we sure the data are persisted?
If so, we could flush old transactions logs?

--
Olivier Favre

www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Thats strange..., that the data was not reduced. Anything in the logs?
Can you try and issue a flush to the index and see if it gets smaller (there
might still be open indexing "reader" against those old index files).

On Thu, Jul 28, 2011 at 8:03 PM, Anurag anurag.phadke@gmail.comwrote:

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

The flag should be set on the mapping of a
type:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html
.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com
wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size,
approx.

320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Anurag Phadke) #7

Shay,
I called the flush API using:

curl -XPOST 'http://localhost:9200/_flush'
it returned quickly with status "ok" and didn't change the size.

-anurag

On Fri, Jul 29, 2011 at 7:31 AM, Shay Banon kimchy@gmail.com wrote:

Yea, it seems like a problem, opened an issue for it: https://github.com/elasticsearch/elasticsearch/issues/1180.
Still don't understand the sync question. Are you referring to when files are fsync'ed? If so, that depends. Lucene fsync files on commit, and elasticsearch will fsync the transaction log on changes.

On Fri, Jul 29, 2011 at 4:50 PM, Olivier Favre olivier@yakaz.com wrote:

I issued
curl -XPOST ..../index/_optimize
which didn't remove the transaction logs.
(I even think I issued _flush and _refresh manually).
And according to the docs, the default values for refresh, flush and wait_for_merge are true.
By sync, I meant the Unix command.

--
Olivier Favre
www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Transaction logs are not removed only on restart. They are removed on a flush (where a Lucene commit is executed, and a new transaction log is created). You can issue an optimize with flush flag set to true, which will flush post optimization (assuming you wait for them).

Once optimized and "sync" returns, are we sure the data are persisted?
What is "sync"? Data is persisted once it has been indexed, it has nothing to do with optimization or flushing.

On Fri, Jul 29, 2011 at 1:20 PM, Olivier Favre olivier@yakaz.com wrote:

I saw that the transactions logs are only getting removed when restarting ES.
If I remember well, they were deleted after an _optimize before 0.17.2.

Once optimized and "sync" returns, are we sure the data are persisted?
If so, we could flush old transactions logs?

Olivier Favre
www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Thats strange..., that the data was not reduced. Anything in the logs? Can you try and issue a flush to the index and see if it gets smaller (there might still be open indexing "reader" against those old index files).

On Thu, Jul 28, 2011 at 8:03 PM, Anurag anurag.phadke@gmail.com wrote:

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

The flag should be set on the mapping of a
type: http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size, approx.
320gb/node before and 312gb/node after compression on indexing 10m
documents.
Any idea what might be wrong here?

-anurag


(Shay Banon) #8

I think what you see is a problem in 0.17.2 that was found by Olivier, where
transaction logs don't get removed on flush. Its already fixed in 0.17
branch and will be part of upcoming 0.17.3 (released this week).

On Mon, Aug 1, 2011 at 5:26 AM, Anurag anurag.phadke@gmail.com wrote:

Shay,
I called the flush API using:

curl -XPOST 'http://localhost:9200/_flush'
it returned quickly with status "ok" and didn't change the size.

-anurag

On Fri, Jul 29, 2011 at 7:31 AM, Shay Banon kimchy@gmail.com wrote:

Yea, it seems like a problem, opened an issue for it:
https://github.com/elasticsearch/elasticsearch/issues/1180.
Still don't understand the sync question. Are you referring to when files
are fsync'ed? If so, that depends. Lucene fsync files on commit, and
elasticsearch will fsync the transaction log on changes.

On Fri, Jul 29, 2011 at 4:50 PM, Olivier Favre olivier@yakaz.com
wrote:

I issued
curl -XPOST ..../index/_optimize
which didn't remove the transaction logs.
(I even think I issued _flush and _refresh manually).
And according to the docs, the default values for refresh, flush and
wait_for_merge are true.

By sync, I meant the Unix command.

--
Olivier Favre
www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Transaction logs are not removed only on restart. They are removed on a
flush (where a Lucene commit is executed, and a new transaction log is
created). You can issue an optimize with flush flag set to true, which will
flush post optimization (assuming you wait for them).

Once optimized and "sync" returns, are we sure the data are
persisted?

What is "sync"? Data is persisted once it has been indexed, it has
nothing to do with optimization or flushing.

On Fri, Jul 29, 2011 at 1:20 PM, Olivier Favre olivier@yakaz.com
wrote:

I saw that the transactions logs are only getting removed when
restarting ES.

If I remember well, they were deleted after an _optimize before
0.17.2.

Once optimized and "sync" returns, are we sure the data are persisted?
If so, we could flush old transactions logs?

Olivier Favre
www.yakaz.com

2011/7/29 Shay Banon kimchy@gmail.com

Thats strange..., that the data was not reduced. Anything in the
logs? Can you try and issue a flush to the index and see if it gets smaller
(there might still be open indexing "reader" against those old index files).

On Thu, Jul 28, 2011 at 8:03 PM, Anurag anurag.phadke@gmail.com
wrote:

Shay,
That worked, I tried to save some more disk by running:
curl -XPOST 'http://localhost:9200/_optimize?max_num_segments=5'

However, this one actually added about 1GB of data to all the nodes,
probably over-optimization on my end?

-anurag

On Wed, Jul 27, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com
wrote:

The flag should be set on the mapping of a
type:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html.

On Thu, Jul 28, 2011 at 4:58 AM, Anurag anurag.phadke@gmail.com
wrote:

Added the compress flag using the following:

curl -XPUT localhost:9200/_settings -d '{
"index" : {
"_source" : {"compress" : true}
}
}'

However, there wasn't any significant difference in the size,
approx.

320gb/node before and 312gb/node after compression on indexing
10m

documents.
Any idea what might be wrong here?

-anurag


(Anita) #9

Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search log data.
There will be few hundred tera bytes of log data.

TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message fields, elasticsearch writes few more fields, but how come elasticsearch data size 3 times of original data size? Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

 // source cntains log message. It has 12 fields. Log message is about 1K.

}

TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du -sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.


(system) #10