Proposal: binary attachments

I'm considering hacking on an extension to support binary attachments.
Essentially

put /index/type/id/attachment-id
get /index/type/id/attachment-id
delete /index/type/id/attachment-id

Each attachment would be stored as replicated files on the nodes.
(this is for big attachments, e.g. video, small attachments can just
be binhexed). Attachments would have some arbitrary size limit, e.g. 5
GB, with the expectation that the client would string pieces together
to make longer files as necessary. Attachments would be separately
hashed - they would not necessarily share the same node as the
document they are attached to.

Initially the client would be responsible for managing uploading,
downloading, and deletion, but there may be conventions discovered
that would eventually be easier to implement on the server.

for example
{
"_attachments": {
"file1": [ "attachment-id1", offset, "attachment-id2",
offset]
}
}
might eventually make the ES server do something different with these
blobs, but initially this would all be up to client convention.

I could use a completely different piece of software (e.g. S3, if I'm
already on AWS, openstack swift, etc.) to handle blobs. This just
feels unnecessarily. ES already has already does 95% of what's
necessary to manage simple blobs in a way few blob managers can match.
Many blob managers are unreasonably slow because they have
considerable management overhead tracking the blobs, where I would be
finding them using ES anyway. For example openstack object store has a
serious bottleneck of 70-100 puts/second/bucket because it has to open/
commit/close an sqlite database for each blob put.

Who thinks this is an inevitable piece of what ES should do, and who
thinks this is useless complication given it can be clearly done with
other software?

Hello Jim,

In the master branch, you now have the possibility to exclude some
fields from _source, and I've made a pull request to enable
compression on binary fields. This means that you could send
attachements along with your document when indexing it, but having
them stored efficiently and retrieved only if necessary:

Index mapping:

"documents" : {
"_source" : {
"excludes" : [ "file" ]
},
"properties" : {
"file" : {
"type" : "binary",
"compress" : "true"
},
...

Document to index:

{
"my_prop" : "stuff",
"file" : base64(file)
}

And then to retrieve the file you can just add the wanted fields in
the fields to retrieve in your query.

This is not exactly like what you describe, but it has the advantage
of being self contained and atomic at the document level.

Jérémie

On Wed, Oct 26, 2011 at 4:19 PM, Jim Hurd jimh@datagrove.com wrote:

I'm considering hacking on an extension to support binary attachments.
Essentially

put /index/type/id/attachment-id
get /index/type/id/attachment-id
delete /index/type/id/attachment-id

Each attachment would be stored as replicated files on the nodes.
(this is for big attachments, e.g. video, small attachments can just
be binhexed). Attachments would have some arbitrary size limit, e.g. 5
GB, with the expectation that the client would string pieces together
to make longer files as necessary. Attachments would be separately
hashed - they would not necessarily share the same node as the
document they are attached to.

Initially the client would be responsible for managing uploading,
downloading, and deletion, but there may be conventions discovered
that would eventually be easier to implement on the server.

for example
{
"_attachments": {
"file1": [ "attachment-id1", offset, "attachment-id2",
offset]
}
}
might eventually make the ES server do something different with these
blobs, but initially this would all be up to client convention.

I could use a completely different piece of software (e.g. S3, if I'm
already on AWS, openstack swift, etc.) to handle blobs. This just
feels unnecessarily. ES already has already does 95% of what's
necessary to manage simple blobs in a way few blob managers can match.
Many blob managers are unreasonably slow because they have
considerable management overhead tracking the blobs, where I would be
finding them using ES anyway. For example openstack object store has a
serious bottleneck of 70-100 puts/second/bucket because it has to open/
commit/close an sqlite database for each blob put.

Who thinks this is an inevitable piece of what ES should do, and who
thinks this is useless complication given it can be clearly done with
other software?

--
Jérémie 'ahFeel' BORDIER

I think base64 attachments are useful, but more for files that are
order 10K, not order 10GB. Imagine storing/searching for videos with
ES. You don't want to base64 encode them, and you don't want to
overflow the log with them, you want to move them as directly as
possible to a storage location. You want to break them into pieces to
efficiently use the cluster in parallel. Both S3 and Openstack Swift
both allow you to push parts of a large file up in parallel, then tie
them together with a manifest (details slightly different depending on
the blob manager, I would use a ES document as a manifest). Saying
that, I realize that ES isn't remotely competitive with either of
these products. I do think this is a logical extension to the base64
binary type already in ES, so I don't think its that far out of its
application area.

Jim

On Oct 26, 10:27 am, Jérémie BORDIER jeremie.bord...@gmail.com
wrote:

Hello Jim,

In the master branch, you now have the possibility to exclude some
fields from _source, and I've made a pull request to enable
compression on binary fields. This means that you could send
attachements along with your document when indexing it, but having
them stored efficiently and retrieved only if necessary:

Index mapping:

"documents" : {
"_source" : {
"excludes" : [ "file" ]
},
"properties" : {
"file" : {
"type" : "binary",
"compress" : "true"
},
...

Document to index:

{
"my_prop" : "stuff",
"file" : base64(file)

}

And then to retrieve the file you can just add the wanted fields in
the fields to retrieve in your query.

This is not exactly like what you describe, but it has the advantage
of being self contained and atomic at the document level.

Jérémie

On Wed, Oct 26, 2011 at 4:19 PM, Jim Hurd j...@datagrove.com wrote:

I'm considering hacking on an extension to support binary attachments.
Essentially

put /index/type/id/attachment-id
get /index/type/id/attachment-id
delete /index/type/id/attachment-id

Each attachment would be stored as replicated files on the nodes.
(this is for big attachments, e.g. video, small attachments can just
be binhexed). Attachments would have some arbitrary size limit, e.g. 5
GB, with the expectation that the client would string pieces together
to make longer files as necessary. Attachments would be separately
hashed - they would not necessarily share the same node as the
document they are attached to.

Initially the client would be responsible for managing uploading,
downloading, and deletion, but there may be conventions discovered
that would eventually be easier to implement on the server.

for example
{
"_attachments": {
"file1": [ "attachment-id1", offset, "attachment-id2",
offset]
}
}
might eventually make the ES server do something different with these
blobs, but initially this would all be up to client convention.

I could use a completely different piece of software (e.g. S3, if I'm
already on AWS, openstack swift, etc.) to handle blobs. This just
feels unnecessarily. ES already has already does 95% of what's
necessary to manage simple blobs in a way few blob managers can match.
Many blob managers are unreasonably slow because they have
considerable management overhead tracking the blobs, where I would be
finding them using ES anyway. For example openstack object store has a
serious bottleneck of 70-100 puts/second/bucket because it has to open/
commit/close an sqlite database for each blob put.

Who thinks this is an inevitable piece of what ES should do, and who
thinks this is useless complication given it can be clearly done with
other software?

--
Jérémie 'ahFeel' BORDIER