Indexing (storing) large binary object

McKittrick_Kaminski · April 30, 2020, 12:46am

Hello guys.

I'm making a plugin for Kibana, in which I want to store executables in Elasticsearch. (I know it sounds weird, trust me, there's a good reason for it.).

I've developed the client side code that will read the file (up to about 100MB), break it into smaller chunks (128KB atm, configurable, obviously), and upload them to Kibana's server side. Kibana's server side them prepares each chunk in base64, then indexes them into Elasticsearch in a particular field in a particular index.

I'm facing really slow upload (indexing) times for files larger than about 25MB, and I'm having trouble finding out where the bottleneck is.

The mapping for the field with data looks as follows

"filename": { "type": "keyword"},
"data": {
  "enabled": false,
  "properties":  {
    "data": {"type": "binary"}, //<-- 160KB after base64
    "chunkNo":  {"type": "integer"}
  }
}

so that it ends up this in ES

file._source
{
    "name": "metricbeat.exe",
    "data": [
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 0
        },
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 1
        },
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 2
        }
    ]
}

Couple of facts:

Indexing is fast <10MB (<1s per 128k chunk)
starts to slow down 10MB -> 50MB (1-3s per 128k chunk)
slowest at 50MB+ (3-6s per 128k chunk)
My index mapping has one field that has a "data": { "type": "binary" }" mapping.
I've tried having this field both object and nested types.
I've tried setting refresh_interval = -1
ES very quickly reaches is Java Heap max, and garbage collects very often.
garbage collection times (that it shows in the stdout) is up to about 1.5 seconds
I have enabled = false set for the binary field the laaarge base64 string is saved.

I don't need any sort of indexing or searching capability whatsoever. In all of my other calls to this index I use _source_excludes=['datafield'] to ignore this data, until I need it. Reading data doesn't seem to take too long, either.

What settings can I set on the index, indexing API call to optimize this? Should I be preparing the data in Kibana's server side differently?

Thanks!

dadoonet · April 30, 2020, 3:03am

Can't you send each chunk as a separate document?

McKittrick_Kaminski · April 30, 2020, 3:27am

Isn't that the purpose of the nested type, but with a little syntactic sugar? I've tried nested and gotten the same results.

Do you think manually saving documents and linking the IDs (effectively implementing nested documents in the kibana server plugin code) of the corresponding chunks into the "main" document would have a different effect?

Christian_Dahlqvist · April 30, 2020, 5:17am

Separate documents would be better as all nested documents are reindexed when you make any modification as the get larger and larger. Indexing chunks as separate documents will scale And perform a lot better than nested documents. I would however still recommend storing this type of content outside Elasticsearch.

dadoonet · April 30, 2020, 7:31am

Yeah. I'm also concerned by the HTTP layer. Sending 100mb of binary (which will end up being more than 100mb as BASE64 text) won't work by default. Also the HEAP size which is needed to hold the full document in memory is much bigger.

Than search, replication, ... will require a lot of disk reads and network usage.

My 2 cents.

That being said, why using elasticsearch in that case?

McKittrick_Kaminski · April 30, 2020, 11:21am

I'll give fully separate documents a shot after this branch, thanks for the input.

I know, I know. I'm buidling an application (plugin) that pairs with a golang agent to do some service enumeration on an endpoint, bring results into Kibana, and Kibana issue relevant beats and configs. There will only be a dozen or so binaries stored in ES, and pulls on those binaries wont be constant. Its not a CDN feature, just a bit of supporting architecture. 98% of the value of this plugin is to bring log content from endpoints into ES easier for a hands-off party. 2% of it requires the capability of delivering binaries.

If I can get this functionality built with any sort of reasonable SLA's without breaking the whole stack, I'll probably use it. More likely, I'll just code in pulling from external sources like S3 buckets. Just wanted to keep it all in house single-deployment.

Thanks guys!

wait wait wait....

Christian_Dahlqvist · April 30, 2020, 11:38am

Spell checker.... sigh

McKittrick_Kaminski · April 30, 2020, 3:01pm

I've decided to go with bucket storage with my cloud provider. APIs for node are pretty clean, and waaaay cheaper and simple than I though.

Not worth

Thanks anyway tho.

system · May 28, 2020, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What would be the best way to store and query large binary files in ES? Elasticsearch	1	612	July 5, 2017
Kibana is slowly reponsing Kibana	3	327	October 11, 2019
Takes a long time to index Kibana	4	2604	July 6, 2017
Big binary fields/files storage in ElasticSearch Elasticsearch	3	7633	July 5, 2017
Trouble Handling Large Volume data - Slow Kibana Dashboard Elasticsearch	5	1089	July 24, 2019

Indexing (storing) large binary object

Related topics