300K file increases index diskspace with 400K


(Eric Ariens) #1

Hi,

I have a question. We do a lot of work for acounting oficces. When testing we discoverd that a 300K file increased the index store with 400K. The file it self is not stored.
The file contain mainly numbers. I have some of the data attached to the question.
Please explain why a file with a lot of number increases the index that mutch.

Regards
Eric


(Magnus B├Ąck) #2

What's the mapping of the type you're storing the data in?

It's common that data when indexed in ES requires more storage than the raw data.


(Eric Ariens) #3

{
"hyarchisdocument": {
"_source": {
"excludes": [
"hyarchis_attachment"
]
},
"_id": {
"path": "hyarchis_id"
},
"properties": {
"index1": {
"type": "integer",
"store": true
},
"index2": {
"type": "integer",
"store": true
},
"index3": {
"type": "integer",
"store": true
},
"index4": {
"type": "integer",
"store": true
},
"index5": {
"type": "string",
"store": true
},
"index6": {
"type": "string",
"store": true
},
"index7": {
"type": "integer",
"store": true
},
"index8": {
"type": "integer",
"store": true
},
"index9": {
"type": "integer",
"store": true
},
"index10": {
"type": "string",
"store": true
},
"index11": {
"type": "integer",
"store": true
},
"index12": {
"type": "integer",
"store": true
},
"index13": {
"type": "integer",
"store": true
},
"index14": {
"type": "string",
"store": true
},
"index15": {
"type": "string",
"store": true
},
"index16": {
"type": "integer",
"store": true
},
"index17": {
"type": "date",
"store": true
},
"index18": {
"type": "string",
"store": true
},
"index19": {
"type": "date",
"store": true
},
"index20": {
"type": "integer",
"store": true
},
"index1020": {
"type": "boolean",
"store": true
},
"index1021": {
"type": "integer",
"store": true
},
"index1022": {
"type": "string",
"store": true
},
"hyarchis_documentid": {
"type": "integer",
"store": true
},
"hyarchis_name": {
"type": "string",
"store": true
},
"hyarchis_extensions": {
"type": "string"
},
"hyarchis_attachment": {
"type": "attachment",
"store": false
}
}
}
}


(Eric Ariens) #4

This is the info stored:
{
"_shard": 1,
"_node": "3CyrcJqmQjm24IZoBtBpuw",
"_index": "index1",
"_type": "hyarchisdocument",
"_id": "4680",
"_score": 0.002076112,
"fields": {
"hyarchis_name": [
"CijferSheet met comma als text"
],
"hyarchis_documentid": [
4680
],
"hyarchis_extensions": [
"txt"
],
"index11": [
78
],
"index1": [
1
],
"hyarchis_id": [
4680
],
"index2": [
15
],
"index5": [
"1"
],
"index7": [
25
],
"index6": [
"Ariens"
],
"index18": [
"BAI_1_Static"
]
},
"sort": [
0.002076112
],
"_explanation": {
"value": 0.002076112,
"description": "weight(_all:4680 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.002076112,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1.7320508,
"description": "tf(freq=3.0), with freq of:",
"details": [
{
"value": 3,
"description": "termFreq=3.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.00390625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
}


(Imran Siddique) #5

You are storing _source as well as indexing the fields and storing them. This will increase the size of index from what u have. If you don't want to support updates, u can just disable _source and store fields u want to retrieve later. Your index size ideally should come down.


(Eric Ariens) #6

We do use updates so we can not disable _source.


(Imran Siddique) #7

I see. But now you know why your index diskspace is more than what you fed!


(Eric Ariens) #8

Can you explain that further. The small amount of data we store in the _source is not enough to explain the huge data. We do not store the file indexed


(Imran Siddique) #9

If you see the mapping, each field (except attachment) you are indexing, storing and also saving in _source. While ES/Lucene creates inverted indices for the fields that are getting indexed, but storing/_source will consume memory (though they will be compressed).

This is a good discussion on when to use store v/s _source: http://stackoverflow.com/questions/28678296/elasticsearch-store-field-vs-source . One more good thread: http://stackoverflow.com/questions/17103047/why-do-i-need-storeyes-in-elasticsearch

Since u want partial update, _source is needed. Question is why u want individual fields to be stored as well.
Regards,
Imran
Please note: I'm also learning ES and my participation in discussions help me learn more!


(Eric Ariens) #10

I see your point. I could only use store. This will save me some diskspace.
But this stil leaves me with the huge disk usage for a file with a lot of numbers.

  • A 300K file with mostly numbers will use 400K on disk.
  • A 300K word document will only increase disk usage with 40K.

(system) #11