Index size


(Slava G ) #1

Hi,
I have table data (csv / excel) - total amount of data is about 35mb.
The mapping for the index on one hand is disables _source field and on the
other hand enable few fields to be stored (total amount of all those fields
is about 300-400 bytes), but the table data it-self is not store (default
setting).
It looks that the index size (on disk) is also about 35mb, is it normal ?
Or there something that could be done to reduce the index size ?

Thank You and Best Regards.


(Karussell) #2

why do you want to optimize it. what kind of problem(s) do you have
because of that?

that asked :slight_smile: you could try to optimize the index and see if this
would reduce the index size.

Peter.

On 13 Jan., 22:37, slavag slav...@gmail.com wrote:

Hi,
I have table data (csv / excel) - total amount of data is about 35mb.
The mapping for the index on one hand is disables _source field and on the
other hand enable few fields to be stored (total amount of all those fields
is about 300-400 bytes), but the table data it-self is not store (default
setting).
It looks that the index size (on disk) is also about 35mb, is it normal ?
Or there something that could be done to reduce the index size ?

Thank You and Best Regards.


(Slava G ) #3

The answer why do I want to optimize it is price of storage.
When you said try to optimize the index, you meant to the source data or
there any way to optimize index itself ?

Best Regards.


(Shay Banon) #4

You can use the optimize API With max_segments parameter to reduce the
number of internal index segments (by default, it will hover around 10).
Can you gist your mapping? I did not quite follow your explanation.

On Sat, Jan 14, 2012 at 1:01 AM, slavag slavago@gmail.com wrote:

The answer why do I want to optimize it is price of storage.
When you said try to optimize the index, you meant to the source data or
there any way to optimize index itself ?

Best Regards.


(Slava G ) #5

The mapping is :
"csv" : {
"_source" : {
"enabled" : false
},
"properties" : {
"id" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"source" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"file" : {
"index" : "no",
"store" : "yes",
"type" : "string"
},
"taskid" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"name" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"data" : {
"type" : "string"
},
"date" : {
"store" : "yes",
"format" : "dateOptionalTime",
"type" : "date"
},
"account" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
}
}
}

In the csv have about 55k rows each row has 5-10 fields that each field
can have list of values, so all values within same row are indexed as one
list (array) and mapped in the "data" field (you can see in the mapping)
So, the total amount of source data with fields that are additional fields
is about 25mb and index (on the disk) is also about same size (more or
less).
So the question is this ok ?
And another question: is changing max_segments could affect on index speed
/ query speed ? Where is this configuration in java api ?

Thank You and Best Regards.


(Shay Banon) #6

Your mapping looks fine. It seems like you store each file individually.
You can have it another way though, instead of storing each field, you can
store _source, but exclude the data field, then, you will be able to
compress it as well:
http://www.elasticsearch.org/guide/reference/mapping/source-field.html(check
exclude part).

Reducing the number of segments in the actual merge configuration will mean
slower indexing. I suggested using the optimize call to reduce it once you
loaded the data (its a one time effort).

On Sat, Jan 14, 2012 at 7:42 PM, slavag slavago@gmail.com wrote:

The mapping is :
"csv" : {
"_source" : {
"enabled" : false
},
"properties" : {
"id" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"source" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"file" : {
"index" : "no",
"store" : "yes",
"type" : "string"
},
"taskid" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"name" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
},
"data" : {
"type" : "string"
},
"date" : {
"store" : "yes",
"format" : "dateOptionalTime",
"type" : "date"
},
"account" : {
"index" : "not_analyzed",
"store" : "yes",
"type" : "string"
}
}
}

In the csv have about 55k rows each row has 5-10 fields that each field
can have list of values, so all values within same row are indexed as one
list (array) and mapped in the "data" field (you can see in the mapping)
So, the total amount of source data with fields that are additional fields
is about 25mb and index (on the disk) is also about same size (more or
less).
So the question is this ok ?
And another question: is changing max_segments could affect on index speed
/ query speed ? Where is this configuration in java api ?

Thank You and Best Regards.


(Slava G ) #7

Hi Shay,
storing _source is interesting approach, not thought about it, just wanted
to ask is it really will help ? The "data" (all values list) field that
actually indexed but not stored and _source field will be indexed but not
stored. Today "data" field it's not stored, only indexed and then if I add
_source to storage this only will increase indexes size or may be I didn't
get you ?

Thank You.

P.S. I did optimize call win max_segments = 1 size remains the same.


(system) #8