70MB csv file becomes 32GB index

So I have a csv file that contains about 720.000 lines like these:
75.00,-179.75,20170101,-16.69,-9.84,-24.39,5.98,0.38,0.38,0.24,0.0,0.0,5.74,-16.69,12.83,1.34

Which I'm loading into an elasticsearch index with the following mapping using a bulk insert:
{
"meteo": {
"mappings": {
"meteo-mapping": {
"properties": {
"DAY": {
"type": "date",
"format": "dateOptionalTime"
},
"GRID_NO": {
"type": "long"
},
"MAXIMUM_TEMPERATURE": {
"type": "float"
},
"MEAN_TEMPERATURE": {
"type": "float"
},
"MINIMUM_TEMPERATURE": {
"type": "float"
},
"RAINFALL": {
"type": "float"
},
"location": {
"type": "geo_shape"
}
}
}
}
}
}

This gives me an index with the same amount of documents (about 720.000). A single document looks like this:
{
"_index": "meteo",
"_type": "meteo-mapping",
"_id": "AV5gnzdNRvAaLryEPhOY",
"_score": 1,
"_source": {
"GRID_NO": 15,
"DAY": "2017-03-01T00:00:00.000Z",
"RAINFALL": 0.18,
"MAXIMUM_TEMPERATURE": -26.3,
"MINIMUM_TEMPERATURE": -26.28,
"MEAN_TEMPERATURE": -27.16,
"location": {
"type": "Polygon",
"coordinates": [
[
[
-176.25,
74.75
],
[
-176,
74.75
],
[
-176,
75
],
[
-176.25,
75
],
[
-176.25,
74.75
]
]
]
}
}
}

With the only problem that our index is now about 32GB in size (including replica, so 16GB if looking at primaries only).
Our cluster consists of 3 elasticsearch nodes and 1 kibana node. The index has 3 shards and 1 replica.

Any ideas where the extra size is coming from?
I know it can increase quite a bit because it's adding structure around it, but this seems a bit extreme.

My guess is the indexing of the geo field is costing you the disk space here.

See Huge difference in index size when bulk indexing geoshapes across ES versions for some detail on indexing options and their costs.

Putting spatial indexing settings aside - if you have very many weather records for the same place it may not be terribly efficient for each of them to describe the exact coordinates of that area with each observation of a day's rainfall. Denormalization may be costing you dearly here.
It may prove better to make your application first geo-query a places index with the shape information indexed in order to then retrieve a list of names/ids that you can then use to query the weather index where place names not coordinates are indexed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.