Indexing (storing) large binary object

Hello guys.

I'm making a plugin for Kibana, in which I want to store executables in Elasticsearch. (I know it sounds weird, trust me, there's a good reason for it.).

I've developed the client side code that will read the file (up to about 100MB), break it into smaller chunks (128KB atm, configurable, obviously), and upload them to Kibana's server side. Kibana's server side them prepares each chunk in base64, then indexes them into Elasticsearch in a particular field in a particular index.

I'm facing really slow upload (indexing) times for files larger than about 25MB, and I'm having trouble finding out where the bottleneck is.

The mapping for the field with data looks as follows

"filename": { "type": "keyword"},
"data": {
  "enabled": false,
  "properties":  {
    "data": {"type": "binary"}, //<-- 160KB after base64
    "chunkNo":  {"type": "integer"}
  }
}

so that it ends up this in ES

file._source
{
    "name": "metricbeat.exe",
    "data": [
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 0
        },
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 1
        },
        {
            "data": "BASE64...BASE64......BASE64......BASE64......BASE64...",
            "chunkNo": 2
        }
    ]
}

Couple of facts:

  • Indexing is fast <10MB (<1s per 128k chunk)
  • starts to slow down 10MB -> 50MB (1-3s per 128k chunk)
  • slowest at 50MB+ (3-6s per 128k chunk)
  • My index mapping has one field that has a "data": { "type": "binary" }" mapping.
  • I've tried having this field both object and nested types.
  • I've tried setting refresh_interval = -1
  • ES very quickly reaches is Java Heap max, and garbage collects very often.
  • garbage collection times (that it shows in the stdout) is up to about 1.5 seconds
  • I have enabled = false set for the binary field the laaarge base64 string is saved.

I don't need any sort of indexing or searching capability whatsoever. In all of my other calls to this index I use _source_excludes=['datafield'] to ignore this data, until I need it. Reading data doesn't seem to take too long, either.

What settings can I set on the index, indexing API call to optimize this? Should I be preparing the data in Kibana's server side differently?

Thanks!

Can't you send each chunk as a separate document?

Isn't that the purpose of the nested type, but with a little syntactic sugar? I've tried nested and gotten the same results.

Do you think manually saving documents and linking the IDs (effectively implementing nested documents in the kibana server plugin code) of the corresponding chunks into the "main" document would have a different effect?

Separate documents would be better as all nested documents are reindexed when you make any modification as the get larger and larger. Indexing chunks as separate documents will scale And perform a lot better than nested documents. I would however still recommend storing this type of content outside Elasticsearch.

Yeah. I'm also concerned by the HTTP layer. Sending 100mb of binary (which will end up being more than 100mb as BASE64 text) won't work by default. Also the HEAP size which is needed to hold the full document in memory is much bigger.

Than search, replication, ... will require a lot of disk reads and network usage.

My 2 cents.

That being said, why using elasticsearch in that case?

I'll give fully separate documents a shot after this branch, thanks for the input.

I know, I know. I'm buidling an application (plugin) that pairs with a golang agent to do some service enumeration on an endpoint, bring results into Kibana, and Kibana issue relevant beats and configs. There will only be a dozen or so binaries stored in ES, and pulls on those binaries wont be constant. Its not a CDN feature, just a bit of supporting architecture. 98% of the value of this plugin is to bring log content from endpoints into ES easier for a hands-off party. 2% of it requires the capability of delivering binaries.

If I can get this functionality built with any sort of reasonable SLA's without breaking the whole stack, I'll probably use it. More likely, I'll just code in pulling from external sources like S3 buckets. Just wanted to keep it all in house single-deployment.

Thanks guys!

wait wait wait....

image

Spell checker.... sigh

I've decided to go with bucket storage with my cloud provider. APIs for node are pretty clean, and waaaay cheaper and simple than I though.

Not worth

Thanks anyway tho.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.