70MB csv file becomes 32GB index

nielsbuyens · September 8, 2017, 2:01pm

So I have a csv file that contains about 720.000 lines like these:
75.00,-179.75,20170101,-16.69,-9.84,-24.39,5.98,0.38,0.38,0.24,0.0,0.0,5.74,-16.69,12.83,1.34

Which I'm loading into an elasticsearch index with the following mapping using a bulk insert:
{
"meteo": {
"mappings": {
"meteo-mapping": {
"properties": {
"DAY": {
"type": "date",
"format": "dateOptionalTime"
},
"GRID_NO": {
"type": "long"
},
"MAXIMUM_TEMPERATURE": {
"type": "float"
},
"MEAN_TEMPERATURE": {
"type": "float"
},
"MINIMUM_TEMPERATURE": {
"type": "float"
},
"RAINFALL": {
"type": "float"
},
"location": {
"type": "geo_shape"
}
}
}
}
}
}

This gives me an index with the same amount of documents (about 720.000). A single document looks like this:
{
"_index": "meteo",
"_type": "meteo-mapping",
"_id": "AV5gnzdNRvAaLryEPhOY",
"_score": 1,
"_source": {
"GRID_NO": 15,
"DAY": "2017-03-01T00:00:00.000Z",
"RAINFALL": 0.18,
"MAXIMUM_TEMPERATURE": -26.3,
"MINIMUM_TEMPERATURE": -26.28,
"MEAN_TEMPERATURE": -27.16,
"location": {
"type": "Polygon",
"coordinates": [
[
[
-176.25,
74.75
],
[
-176,
74.75
],
[
-176,
75
],
[
-176.25,
75
],
[
-176.25,
74.75
]
]
]
}
}
}

With the only problem that our index is now about 32GB in size (including replica, so 16GB if looking at primaries only).
Our cluster consists of 3 elasticsearch nodes and 1 kibana node. The index has 3 shards and 1 replica.

Any ideas where the extra size is coming from?
I know it can increase quite a bit because it's adding structure around it, but this seems a bit extreme.

Mark_Harwood · September 8, 2017, 4:55pm

My guess is the indexing of the geo field is costing you the disk space here.

See Huge difference in index size when bulk indexing geoshapes across ES versions for some detail on indexing options and their costs.

Putting spatial indexing settings aside - if you have very many weather records for the same place it may not be terribly efficient for each of them to describe the exact coordinates of that area with each observation of a day's rainfall. Denormalization may be costing you dearly here.
It may prove better to make your application first geo-query a places index with the shape information indexed in order to then retrieve a list of names/ids that you can then use to query the weather index where place names not coordinates are indexed.

system · October 6, 2017, 4:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I can't load all documents from csv file using python Elasticsearch language-clients	4	431	June 21, 2022
Elasticsearch index storage size Elasticsearch	2	586	November 22, 2019
Best method - Importing 50x10gb CSV files into Elasticsearch on GCE Elasticsearch	6	8915	July 6, 2017
Elasticsearch index size less then dataset disk space Elasticsearch	2	291	August 2, 2021
Queries on Elastic Search Configuration and Bulk Import Elasticsearch	1	339	July 6, 2017

70MB csv file becomes 32GB index

Related topics