Question on Index Size

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which is
    very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul

Store level compression is one way to compress the data (available in 0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com wrote:

Hi,

I was trying to understand how much size an index would take for a certain size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which is very significant increase.

Then I tried not to include _source and found that the index size reduced by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of Indices?

Thanks
Rahul

I am in 0.19.3

I tried following things which helped.

  1. Reduced the size of Field keys.
  2. Did a _source compression.

This reduced the size of index by half, but in my context its still very
big.

  1. I tried removing _source. This reduced the size of the index 50%
    further. But if I understand correctly, search will stop working as it
    works on _source.
  2. Since I store each value as field for faceting, wondering if there is a
    way to reconstruct the document from the Fields which I index as "not
    analyzed".

Thanks
Rahul

On Wed, Aug 8, 2012 at 3:30 AM, Shay Banon kimchy@gmail.com wrote:

Store level compression is one way to compress the data (available in
0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com
wrote:

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which
    is very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul

When you say store level compression, do you mean compression will work if
the fields are marked field("store", "yes") or otherwise as well?
Is it additional to _source compression?

Does it impact faceting performance?

On Wed, Aug 8, 2012 at 3:30 AM, Shay Banon kimchy@gmail.com wrote:

Store level compression is one way to compress the data (available in
0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com
wrote:

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which
    is very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul