Hello guys.
I'm making a plugin for Kibana, in which I want to store executables in Elasticsearch. (I know it sounds weird, trust me, there's a good reason for it.).
I've developed the client side code that will read the file (up to about 100MB), break it into smaller chunks (128KB atm, configurable, obviously), and upload them to Kibana's server side. Kibana's server side them prepares each chunk in base64, then indexes them into Elasticsearch in a particular field in a particular index.
I'm facing really slow upload (indexing) times for files larger than about 25MB, and I'm having trouble finding out where the bottleneck is.
The mapping for the field with data looks as follows
"filename": { "type": "keyword"},
"data": {
"enabled": false,
"properties": {
"data": {"type": "binary"}, //<-- 160KB after base64
"chunkNo": {"type": "integer"}
}
}
so that it ends up this in ES
file._source
{
"name": "metricbeat.exe",
"data": [
{
"data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
"chunkNo": 0
},
{
"data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
"chunkNo": 1
},
{
"data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
"chunkNo": 2
}
]
}
Couple of facts:
- Indexing is fast <10MB (<1s per 128k chunk)
- starts to slow down 10MB -> 50MB (1-3s per 128k chunk)
- slowest at 50MB+ (3-6s per 128k chunk)
- My index mapping has one field that has a
"data": { "type": "binary" }"
mapping. - I've tried having this field both
object
andnested
types. - I've tried setting
refresh_interval = -1
- ES very quickly reaches is Java Heap max, and garbage collects very often.
- garbage collection times (that it shows in the stdout) is up to about 1.5 seconds
- I have
enabled = false
set for the binary field the laaarge base64 string is saved.
I don't need any sort of indexing or searching capability whatsoever. In all of my other calls to this index I use _source_excludes=['datafield']
to ignore this data, until I need it. Reading data doesn't seem to take too long, either.
What settings can I set on the index, indexing API call to optimize this? Should I be preparing the data in Kibana's server side differently?
Thanks!