Hi,
I have table data (csv / excel) - total amount of data is about 35mb.
The mapping for the index on one hand is disables _source field and on the
other hand enable few fields to be stored (total amount of all those fields
is about 300-400 bytes), but the table data it-self is not store (default
setting).
It looks that the index size (on disk) is also about 35mb, is it normal ?
Or there something that could be done to reduce the index size ?
Hi,
I have table data (csv / excel) - total amount of data is about 35mb.
The mapping for the index on one hand is disables _source field and on the
other hand enable few fields to be stored (total amount of all those fields
is about 300-400 bytes), but the table data it-self is not store (default
setting).
It looks that the index size (on disk) is also about 35mb, is it normal ?
Or there something that could be done to reduce the index size ?
The answer why do I want to optimize it is price of storage.
When you said try to optimize the index, you meant to the source data or
there any way to optimize index itself ?
You can use the optimize API With max_segments parameter to reduce the
number of internal index segments (by default, it will hover around 10).
Can you gist your mapping? I did not quite follow your explanation.
The answer why do I want to optimize it is price of storage.
When you said try to optimize the index, you meant to the source data or
there any way to optimize index itself ?
In the csv have about 55k rows each row has 5-10 fields that each field
can have list of values, so all values within same row are indexed as one
list (array) and mapped in the "data" field (you can see in the mapping)
So, the total amount of source data with fields that are additional fields
is about 25mb and index (on the disk) is also about same size (more or
less).
So the question is this ok ?
And another question: is changing max_segments could affect on index speed
/ query speed ? Where is this configuration in java api ?
Your mapping looks fine. It seems like you store each file individually.
You can have it another way though, instead of storing each field, you can
store _source, but exclude the data field, then, you will be able to
compress it as well: Elasticsearch Platform — Find real-time answers at scale | Elastic(check
exclude part).
Reducing the number of segments in the actual merge configuration will mean
slower indexing. I suggested using the optimize call to reduce it once you
loaded the data (its a one time effort).
In the csv have about 55k rows each row has 5-10 fields that each field
can have list of values, so all values within same row are indexed as one
list (array) and mapped in the "data" field (you can see in the mapping)
So, the total amount of source data with fields that are additional fields
is about 25mb and index (on the disk) is also about same size (more or
less).
So the question is this ok ?
And another question: is changing max_segments could affect on index speed
/ query speed ? Where is this configuration in java api ?
Hi Shay,
storing _source is interesting approach, not thought about it, just wanted
to ask is it really will help ? The "data" (all values list) field that
actually indexed but not stored and _source field will be indexed but not
stored. Today "data" field it's not stored, only indexed and then if I add
_source to storage this only will increase indexes size or may be I didn't
get you ?
Thank You.
P.S. I did optimize call win max_segments = 1 size remains the same.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.