Question on Index Size


(Rahul Sharma) #1

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which is
    very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul


(Shay Banon) #2

Store level compression is one way to compress the data (available in 0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com wrote:

Hi,

I was trying to understand how much size an index would take for a certain size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which is very significant increase.

Then I tried not to include _source and found that the index size reduced by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of Indices?

Thanks
Rahul


(Rahul Sharma) #3

I am in 0.19.3

I tried following things which helped.

  1. Reduced the size of Field keys.
  2. Did a _source compression.

This reduced the size of index by half, but in my context its still very
big.

  1. I tried removing _source. This reduced the size of the index 50%
    further. But if I understand correctly, search will stop working as it
    works on _source.
  2. Since I store each value as field for faceting, wondering if there is a
    way to reconstruct the document from the Fields which I index as "not
    analyzed".

Thanks
Rahul

On Wed, Aug 8, 2012 at 3:30 AM, Shay Banon kimchy@gmail.com wrote:

Store level compression is one way to compress the data (available in
0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com
wrote:

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which
    is very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul


(Rahul Sharma) #4

When you say store level compression, do you mean compression will work if
the fields are marked field("store", "yes") or otherwise as well?
Is it additional to _source compression?

Does it impact faceting performance?

On Wed, Aug 8, 2012 at 3:30 AM, Shay Banon kimchy@gmail.com wrote:

Store level compression is one way to compress the data (available in
0.19.8). Note that CSV is quite different by itself than json format.

On Aug 5, 2012, at 2:12 PM, Rahul Sharma rahul.sharma.coder@gmail.com
wrote:

Hi,

I was trying to understand how much size an index would take for a certain
size of input data.

Below is the scenario and observation:

  1. For the purpose I picked 10 column csv with 1000 rows. The size of the
    csv is 111 KB.
  2. I created 2 Field (of type String) for each column. 1 Analyzed to run
    search and 1 Not Analyzed to run facet.
  3. The index was configured to create 5 segments.
  4. After indexing I found that the size of the index was 4.5MB. (This
    includes all 5 shards , trans log etc...)
  5. Which means its almost 45 times more than the original size. Which
    is very significant increase.

Then I tried not to include _source and found that the index size reduced
by 25%. Came down to 3 mb. Which is still significant.

Am I missing something? Or is there any other ways to reduce the size of
Indices?

Thanks
Rahul


(system) #5