I used mapper-attachments in order to index files. created the request to add it to field "file" and it's copied to the other fields. The following mapping definition was used with ES2
{
"files": {
"properties": {
"startDate": {
"type": "date", "index": "not_analyzed", "store": false,
"format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
},
"mimetype": {
"type": "integer", "index": "not_analyzed", "store": false
},
"file": {
"type": "attachment",
"fields": {
"title": { "store": false },
"content_type": { "store": false, "index": "no" },
"content": { "store": false, "term_vector": "with_positions_offsets", "type": "string", "copy_to": ["fileGrams", "fileEn", "fileLang1", "fileLang2"] },
"date": { "store": false },
"author": { "store": false },
"keywords": { "store": false },
"content_type" : { "store": false },
"language": { "store": false }
}
},
"fileGrams": {
"type": "string", "index": "analyzed", "analyzer": "angram"
},
"fileEn": {
"type": "string", "index": "analyzed", "analyzer": "alangen"
},
"fileLang1": {
"type": "string", "index": "analyzed", "analyzer": "alang1"
},
"fileLang2": {
"type": "string", "index": "analyzed", "analyzer": "alang2"
}
}
}
}
with ES5 I switched to ingest-attachment. Due to some changes with ES5 (no string field anymore and multi fields) the updated mapping looks like:
{
"files": {
"properties": {
"startDate": {
"type": "date", "index": true, "store": false,
"format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
},
"mimetype": {
"type": "integer", "index": true, "store": false
},
"file": {
"type": "text", "index": true,
"fields": {
"fileGrams": {
"type": "text", "index": true, "analyzer": "angram"
},
"fileEn": {
"type": "text", "index": true, "analyzer": "alangen"
},
"fileLang1": {
"type": "text", "index": true, "analyzer": "alang1"
},
"fileLang2": {
"type": "text", "index": true, "analyzer": "alang2"
}
}
}
}
}
}
I now face the problem that with a defined set of data (around 120.000 documents, varying from email, pdf, xml etc.) the size of the index is much higher.
ES2: 350MB
ES5: 1800MB
when I remove the fields within "file" field (no multi fields):
ES5: ~600MB
Any ideas/explanations what the reason(s) might be?