300K file increases index diskspace with 400K

Hi,

I have a question. We do a lot of work for acounting oficces. When testing we discoverd that a 300K file increased the index store with 400K. The file it self is not stored.
The file contain mainly numbers. I have some of the data attached to the question.
Please explain why a file with a lot of number increases the index that mutch.

Regards
Eric

What's the mapping of the type you're storing the data in?

It's common that data when indexed in ES requires more storage than the raw data.

{
"hyarchisdocument": {
"_source": {
"excludes": [
"hyarchis_attachment"
]
},
"_id": {
"path": "hyarchis_id"
},
"properties": {
"index1": {
"type": "integer",
"store": true
},
"index2": {
"type": "integer",
"store": true
},
"index3": {
"type": "integer",
"store": true
},
"index4": {
"type": "integer",
"store": true
},
"index5": {
"type": "string",
"store": true
},
"index6": {
"type": "string",
"store": true
},
"index7": {
"type": "integer",
"store": true
},
"index8": {
"type": "integer",
"store": true
},
"index9": {
"type": "integer",
"store": true
},
"index10": {
"type": "string",
"store": true
},
"index11": {
"type": "integer",
"store": true
},
"index12": {
"type": "integer",
"store": true
},
"index13": {
"type": "integer",
"store": true
},
"index14": {
"type": "string",
"store": true
},
"index15": {
"type": "string",
"store": true
},
"index16": {
"type": "integer",
"store": true
},
"index17": {
"type": "date",
"store": true
},
"index18": {
"type": "string",
"store": true
},
"index19": {
"type": "date",
"store": true
},
"index20": {
"type": "integer",
"store": true
},
"index1020": {
"type": "boolean",
"store": true
},
"index1021": {
"type": "integer",
"store": true
},
"index1022": {
"type": "string",
"store": true
},
"hyarchis_documentid": {
"type": "integer",
"store": true
},
"hyarchis_name": {
"type": "string",
"store": true
},
"hyarchis_extensions": {
"type": "string"
},
"hyarchis_attachment": {
"type": "attachment",
"store": false
}
}
}
}

This is the info stored:
{
"_shard": 1,
"_node": "3CyrcJqmQjm24IZoBtBpuw",
"_index": "index1",
"_type": "hyarchisdocument",
"_id": "4680",
"_score": 0.002076112,
"fields": {
"hyarchis_name": [
"CijferSheet met comma als text"
],
"hyarchis_documentid": [
4680
],
"hyarchis_extensions": [
"txt"
],
"index11": [
78
],
"index1": [
1
],
"hyarchis_id": [
4680
],
"index2": [
15
],
"index5": [
"1"
],
"index7": [
25
],
"index6": [
"Ariens"
],
"index18": [
"BAI_1_Static"
]
},
"sort": [
0.002076112
],
"_explanation": {
"value": 0.002076112,
"description": "weight(_all:4680 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.002076112,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1.7320508,
"description": "tf(freq=3.0), with freq of:",
"details": [
{
"value": 3,
"description": "termFreq=3.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
}

You are storing _source as well as indexing the fields and storing them. This will increase the size of index from what u have. If you don't want to support updates, u can just disable _source and store fields u want to retrieve later. Your index size ideally should come down.

We do use updates so we can not disable _source.

I see. But now you know why your index diskspace is more than what you fed!

Can you explain that further. The small amount of data we store in the _source is not enough to explain the huge data. We do not store the file indexed

If you see the mapping, each field (except attachment) you are indexing, storing and also saving in _source. While ES/Lucene creates inverted indices for the fields that are getting indexed, but storing/_source will consume memory (though they will be compressed).

This is a good discussion on when to use store v/s _source: http://stackoverflow.com/questions/28678296/elasticsearch-store-field-vs-source . One more good thread: http://stackoverflow.com/questions/17103047/why-do-i-need-storeyes-in-elasticsearch

Since u want partial update, _source is needed. Question is why u want individual fields to be stored as well.
Regards,
Imran
Please note: I'm also learning ES and my participation in discussions help me learn more!

I see your point. I could only use store. This will save me some diskspace.
But this stil leaves me with the huge disk usage for a file with a lot of numbers.

  • A 300K file with mostly numbers will use 400K on disk.
  • A 300K word document will only increase disk usage with 40K.