So I have a csv file that contains about 720.000 lines like these:
75.00,-179.75,20170101,-16.69,-9.84,-24.39,5.98,0.38,0.38,0.24,0.0,0.0,5.74,-16.69,12.83,1.34
Which I'm loading into an elasticsearch index with the following mapping using a bulk insert:
{
"meteo": {
"mappings": {
"meteo-mapping": {
"properties": {
"DAY": {
"type": "date",
"format": "dateOptionalTime"
},
"GRID_NO": {
"type": "long"
},
"MAXIMUM_TEMPERATURE": {
"type": "float"
},
"MEAN_TEMPERATURE": {
"type": "float"
},
"MINIMUM_TEMPERATURE": {
"type": "float"
},
"RAINFALL": {
"type": "float"
},
"location": {
"type": "geo_shape"
}
}
}
}
}
}
This gives me an index with the same amount of documents (about 720.000). A single document looks like this:
{
"_index": "meteo",
"_type": "meteo-mapping",
"_id": "AV5gnzdNRvAaLryEPhOY",
"_score": 1,
"_source": {
"GRID_NO": 15,
"DAY": "2017-03-01T00:00:00.000Z",
"RAINFALL": 0.18,
"MAXIMUM_TEMPERATURE": -26.3,
"MINIMUM_TEMPERATURE": -26.28,
"MEAN_TEMPERATURE": -27.16,
"location": {
"type": "Polygon",
"coordinates": [
[
[
-176.25,
74.75
],
[
-176,
74.75
],
[
-176,
75
],
[
-176.25,
75
],
[
-176.25,
74.75
]
]
]
}
}
}
With the only problem that our index is now about 32GB in size (including replica, so 16GB if looking at primaries only).
Our cluster consists of 3 elasticsearch nodes and 1 kibana node. The index has 3 shards and 1 replica.
Any ideas where the extra size is coming from?
I know it can increase quite a bit because it's adding structure around it, but this seems a bit extreme.