Using ES as a distributed datastore to only store binary data (mainly JPG, PNG, SVG), basically replacing our use of GlusterFs

Stephane_Bastian · March 13, 2013, 8:40am

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even
plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with
the result. We first started to use it for its search capability, and soon
realized that we could also use it as our main data store. So we stopped
using mongoDb and have been relying solely on ES to store and search. This
is working fine. Performance is excellent, the Api is great and on top of
that operating ES on a production server is a pleasure (unlike a lot of
products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG, etc..).
Each time a picture is saved, a background thread creates several versions
of the picture (mobile, tablet, desktop low-res, desktop high-res). This
way, we can easily send a picture that's optimized for the user's device.

We currently have done some experiment with GlusterFs and it's working
fine. However we would like to keep operations light and smooth and would
prefer to keep our technology stack at a minimum. And this is how we came
up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and know
that this is not what it was meant to do best.

The typical use case would be :

Storing binary data
Getting binary data by id
No search, query or facets

Pros:

We would keep the same technology stack, easing our operations (upgrading
to new versions, monitoring, etc...)
We would use the distributed nature of ES and some of its functionality
for this particular use case (number of replica, etc) that on paper look
like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's
not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · March 13, 2013, 8:54am

Just some notes about this.

1st: I use ES as a database in my scrutmydocs.org project for storing binaries (and provide a search on top of it). So it's a use case similar to yours. That said, I'm not sure if someone use it in production (I have heard that some would like to use it but I don't have any feedback yet).
2nd: What is missing by now with Elasticsearch is the ability to get a binary document with a single REST call like this: http://localhost:9200/index/type/1/_attachment/myfile.png
But, you can handle it on client side by getting the doc http://localhost:9200/index/type/1 and decode its BASE64 content.

I think we will probably provide in a next version something similar to this that will simplify somehow your life!

The best thing you can do by now is to test it. I don't see any ugly points to think about as long as you don't store gigabytes per document which will probably requires some tuning!

HTH

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 13 mars 2013 à 09:40, Stephane Bastian stephane.bastian.dev@gmail.com a écrit :

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with the result. We first started to use it for its search capability, and soon realized that we could also use it as our main data store. So we stopped using mongoDb and have been relying solely on ES to store and search. This is working fine. Performance is excellent, the Api is great and on top of that operating ES on a production server is a pleasure (unlike a lot of products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG, etc..). Each time a picture is saved, a background thread creates several versions of the picture (mobile, tablet, desktop low-res, desktop high-res). This way, we can easily send a picture that's optimized for the user's device.

We currently have done some experiment with GlusterFs and it's working fine. However we would like to keep operations light and smooth and would prefer to keep our technology stack at a minimum. And this is how we came up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and know that this is not what it was meant to do best.

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

Pros:

We would keep the same technology stack, easing our operations (upgrading to new versions, monitoring, etc...)

We would use the distributed nature of ES and some of its functionality for this particular use case (number of replica, etc) that on paper look like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stephane_Bastian · March 13, 2013, 2:05pm

Hello David,

Thanks.
I believe scrutMydocs.org uses attachments handled by Tika behind the scene
at index time. In our case, we had planned to define a binary field instead
of an attachment ->
Elasticsearch Platform — Find real-time answers at scale | Elastic
In the end, it probably doesn't make any difference and is equivalent to
using attachment (excepted that the content of the binary field is not
indexed and therefore not searchable. which is fine for images)

In term of size, images would not be too big. Lets say that High-resolution
pictures would be a couple of MBs. Mobile images should be a couple of Kbs.

Would do some testing and report on the mailing.

In the mean time if someone has done something similar, please don't
hesitate to share your experience/pros/cons and such

Thanks

Stephane

On Wednesday, March 13, 2013 9:54:49 AM UTC+1, David Pilato wrote:

Just some notes about this.

1st: I use ES as a database in my scrutmydocs.org project for storing
binaries (and provide a search on top of it). So it's a use case similar to
yours. That said, I'm not sure if someone use it in production (I have
heard that some would like to use it but I don't have any feedback yet).
2nd: What is missing by now with Elasticsearch is the ability to get a
binary document with a single REST call like this:
http://localhost:9200/index/type/1/_attachment/myfile.png
But, you can handle it on client side by getting the doc
http://localhost:9200/index/type/1 and decode its BASE64 content.

I think we will probably provide in a next version something similar to
this that will simplify somehow your life!

The best thing you can do by now is to test it. I don't see any ugly
points to think about as long as you don't store gigabytes per document
which will probably requires some tuning!

HTH

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2013 à 09:40, Stephane Bastian <stephane.b...@gmail.com<javascript:>>
a écrit :

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even
plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with
the result. We first started to use it for its search capability, and soon
realized that we could also use it as our main data store. So we stopped
using mongoDb and have been relying solely on ES to store and search. This
is working fine. Performance is excellent, the Api is great and on top of
that operating ES on a production server is a pleasure (unlike a lot of
products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG, etc..).
Each time a picture is saved, a background thread creates several versions
of the picture (mobile, tablet, desktop low-res, desktop high-res). This
way, we can easily send a picture that's optimized for the user's device.

We currently have done some experiment with GlusterFs and it's working
fine. However we would like to keep operations light and smooth and
would prefer to keep our technology stack at a minimum. And this is how
we came up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and know
that this is not what it was meant to do best.

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

Pros:

We would keep the same technology stack, easing our operations
(upgrading to new versions, monitoring, etc...)

We would use the distributed nature of ES and some of its functionality
for this particular use case (number of replica, etc) that on paper look
like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's
not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lukas_Vlcek1 · March 13, 2013, 3:19pm

Hi,

this might be an interesting use case, I would love to see some of your
benchmarks.

From what I have red on the net (as do not consider myself an Lucene
expert) there might be some concerns though. If binary data stored in
Lucene needs to be returned as a part of the response doesn't that mean it
needs to load document binary data into memory first? This could mean that
the HEAP memory can be subject of much more frequent GC and it can also
lead to Lucene caches rebuilding.

Depending on specific use case and your HW resources this might not be a
big problem but you probably can not tell without detailed testing.

I Would like to hear from Lucene experts on this topic as well.

Regards,
Lukas

On Wed, Mar 13, 2013 at 3:05 PM, Stephane Bastian <
stephane.bastian.dev@gmail.com> wrote:

Hello David,

Thanks.
I believe scrutMydocs.org uses attachments handled by Tika behind the
scene at index time. In our case, we had planned to define a binary field
instead of an attachment ->
Elasticsearch Platform — Find real-time answers at scale | Elastic
In the end, it probably doesn't make any difference and is equivalent to
using attachment (excepted that the content of the binary field is not
indexed and therefore not searchable. which is fine for images)

In term of size, images would not be too big. Lets say that
High-resolution pictures would be a couple of MBs. Mobile images should be
a couple of Kbs.

Would do some testing and report on the mailing.

In the mean time if someone has done something similar, please don't
hesitate to share your experience/pros/cons and such

Thanks

Stephane

On Wednesday, March 13, 2013 9:54:49 AM UTC+1, David Pilato wrote:

Just some notes about this.

1st: I use ES as a database in my scrutmydocs.org project for storing
binaries (and provide a search on top of it). So it's a use case similar to
yours. That said, I'm not sure if someone use it in production (I have
heard that some would like to use it but I don't have any feedback yet).
2nd: What is missing by now with Elasticsearch is the ability to get a
binary document with a single REST call like this:
http://localhost:9200/index/**type/1/_attachment/myfile.png http://localhost:9200/index/type/1/_attachment/myfile.png
But, you can handle it on client side by getting the doc
http://localhost:9200/**index/type/1 http://localhost:9200/index/type/1and decode its BASE64 content.

I think we will probably provide in a next version something similar to
this that will simplify somehow your life!

The best thing you can do by now is to test it. I don't see any ugly
points to think about as long as you don't store gigabytes per document
which will probably requires some tuning!

HTH

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
|** @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2013 à 09:40, Stephane Bastian <stephane.b...@gmail.**com> a
écrit :

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even
plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with
the result. We first started to use it for its search capability, and soon
realized that we could also use it as our main data store. So we stopped
using mongoDb and have been relying solely on ES to store and search. This
is working fine. Performance is excellent, the Api is great and on top of
that operating ES on a production server is a pleasure (unlike a lot of
products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG,
etc..). Each time a picture is saved, a background thread creates several
versions of the picture (mobile, tablet, desktop low-res, desktop
high-res). This way, we can easily send a picture that's optimized for the
user's device.

We currently have done some experiment with GlusterFs and it's working
fine. However we would like to keep operations light and smooth and
would prefer to keep our technology stack at a minimum. And this is how
we came up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and
know that this is not what it was meant to do best.

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

Pros:

We would keep the same technology stack, easing our operations
(upgrading to new versions, monitoring, etc...)

We would use the distributed nature of ES and some of its functionality
for this particular use case (number of replica, etc) that on paper look
like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's
not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 13, 2013, 6:29pm

The following use case looks good enough to be worth the time to research
and benchmark:

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

You would probably want to put this into its own index, and then
cross-reference the id in some other index that also contained any image
metadata (names, geo-coordinates, and so on) that is associated with the
image.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · May 23, 2013, 1:00am

Hi guys, Im working on this...

basically its an imaging server (or a plugin on top of ES), called imagenii
the abilities are:

image replication
Image transformations on the fly (crop, resize, fill, filters etc...)
actually like 150 filters available
json, jpg, png and gif suppported

and of course the search

from this sent
{"url":"http://myimage.jpg"}

inflated to this and much more not seen like dominant colors extraction,
scene identification...
http://localhost:9200/_imagenii/imagenii/4.json?pretty=true
{
_index: "imagenii",
_type: "data",
_id: "4",
_version: 3,
exists: true,
_source: {
type: "photo",
date: "2010:03:28 22:22:22",
width: 1600,
format: "jpeg",
height: 1200,
orientation: "landscape",
hash: "1101000001111010011100000111100001100000111111100010011001111100",
metadata: {
Exif Thumbnail: {
Orientation: "Top, left side (Horizontal / normal)",
X Resolution: "72 dots per inch",
Thumbnail Offset: "702 bytes",
Thumbnail Length: "6738 bytes",
Resolution Unit: "Inch",
Thumbnail Compression: "JPEG (old-style)",
Y Resolution: "72 dots per inch"
},
Exif SubIFD: {
F-Number: "F2,8",
Aperture Value: "F2,8",
Date/Time Original: "2010:03:28 22:22:22",
Metering Mode: "Average",
Color Space: "sRGB",
Exposure Mode: "Auto exposure",
Exif Version: "2.21",
Exif Image Width: "1600 pixels",
Components Configuration: "YCbCr",
Sensing Method: "One-chip color area sensor",
FlashPix Version: "1.00",
Flash: "Flash did not fire",
Date/Time Digitized: "2010:03:28 22:22:22",
Exif Image Height: "1200 pixels",
Exposure Program: "Program normal",
White Balance Mode: "Auto white balance"
},
Jpeg: {
Component 3: "Cr component: Quantization table 1, Sampling factors 1
horiz/1 vert",
Number of Components: "3",
Image Height: "1200 pixels",
Data Precision: "8 bits",
Compression Type: "Baseline",
Image Width: "1600 pixels",
Component 1: "Y component: Quantization table 0, Sampling factors 2 horiz/2
vert",
Component 2: "Cb component: Quantization table 1, Sampling factors 1
horiz/1 vert"
},
GPS: {
GPS Latitude Ref: "S",
GPS Time-Stamp: "22:22:16 UTC",
GPS Longitude: "151.0° 12.0' 9.599999999970805"",
GPS Longitude Ref: "E",
GPS Latitude: "-33.0° 52.0' 38.999999999991815""
},
Exif IFD0: {
Software: "3.1.2",
Date/Time: "2010:03:28 22:22:22",
Orientation: "Top, left side (Horizontal / normal)",
Model: "iPhone 3G",
X Resolution: "72 dots per inch",
YCbCr Positioning: "Center of pixel array",
Resolution Unit: "Inch",
Y Resolution: "72 dots per inch",
Make: "Apple"
}
},
binary: @binary@,
url: "",
geos: [
{
lng: "151,2026666667",
lat: "-33,8775000000"
}
]
}
}

I have writen the transformations in a js chainable way ie:
http://localhost:9200/_imagenii/imagenii/4.jpg?chain=crop(w=150).grayscale()http://localhost:9200/_imagenii/imagenii/4.jpg?chain=crop().grayscale()
http://localhost:9200/_imagenii/imagenii/4.png?chain=crop(w=150, h=100,
pos=center).blur()http://localhost:9200/_imagenii/imagenii/4.jpg?chain=crop().grayscale()
.solarize()

clever stuff using NLP, machine learning and computer vision.

To name a few scene identification, extract geolocations using GPS and/or
NLP NER in contained metadata, extract relevant metadata, extract dominant
colors, visual search based on several descriptors CEDD, CCEDD... duplicate
reduction using a dhash as id... Maybe SIFT in the future if I have proper
ROI

Basically I think you can do anything... A new tineye, the new instagram, a
search engine by shape, colors or texture for e-commerce, a CDN of images
with imaging and search capabilities...

I have done a lot. As David has said a few things has been to be done to
support all the things that I have in the todo list. For me discovering the
missing headers in the rest response has been a heartbreaker ;). Im waiting
the next release to continue with the proper headers

About perfomance? Well, ES is quite slow when hitting big files, dealing
with the binary is not a piece of cake and Im dealing with a lot of
permagen glitches.

When dealing with small images (under 600px) perfomance is under 10ms in
very heavy stress 500 users in a rampage of 1s per 30 secs but basically
just because Im using ES aswell as a cache so the transformations only
occur once... Dealing with big files is another story... same stress is
giving me a median of 8/9 secs maybe thats pretty normal with much more
conventional servers like apache I have to compare it but at the minute I
was tighting everything a bit...

Lets see, Im finishing the project in between this month and the next one
and planning the ROI so I presume that at the end of the month I will have
something live to be tested and tasted

Best

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stephane_Bastian · May 23, 2013, 6:36am

Hey David,

This is really great !

Quick question:
I guess part of your code is not ES specific (image Crop, resize and such).
When you said that it can be slow with big files, did you get a chance to
run image transformations outside of ES to know if it's slow because of ES?

Can't wait for to test and taste

All the best,

On Wednesday, March 13, 2013 9:40:59 AM UTC+1, Stephane Bastian wrote:

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even
plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with
the result. We first started to use it for its search capability, and soon
realized that we could also use it as our main data store. So we stopped
using mongoDb and have been relying solely on ES to store and search. This
is working fine. Performance is excellent, the Api is great and on top of
that operating ES on a production server is a pleasure (unlike a lot of
products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG, etc..).
Each time a picture is saved, a background thread creates several versions
of the picture (mobile, tablet, desktop low-res, desktop high-res). This
way, we can easily send a picture that's optimized for the user's device.

We currently have done some experiment with GlusterFs and it's working
fine. However we would like to keep operations light and smooth and
would prefer to keep our technology stack at a minimum. And this is how
we came up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and know
that this is not what it was meant to do best.

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

Pros:

We would keep the same technology stack, easing our operations
(upgrading to new versions, monitoring, etc...)

We would use the distributed nature of ES and some of its functionality
for this particular use case (number of replica, etc) that on paper look
like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's
not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · May 23, 2013, 7:20am

Hi Stephane,

the numbers i gave were merely picking just only two fields that I need
"binary" and "type". The transformations are not taken into account since
Im using a cache layer to avoid doing transformations which are not very
expensive apart of creating a BufferedImage which is very slow and painful
in large images.

The process is:

->request the object fields "binary", "format"
->if exists and transformations are needed we transform
->transforms check in its cache, if there just return if not transform
->if not transformations needed just return binary

so http://localhost:9200/_imagenii/imagenii/4.jpg
is going to return the posted original image and this
http://localhost:9200/_imagenii/imagenii/4.p http://localhost:9200/_imagenii/imagenii/4.jpg
ng
if returning a png version of the image. To do this I have to create a
BufferedImage from the binary changing first the color space to not
overkill the bufferedimage which does not perform well with not sRGB color
spaces and write back to a byte array, this only happens if i dont have it
in the cache already whick in that case I just simply return the cache.

the cache layer in determined by a ttl parameter, when you store the
tranformed image in the cache is going to live for the given ttl and of
course returning the proper expiration time to not being requested by the
browser again like amazon for example.

With imagenii you create thumbnails with steroids, no need to maintain them
since the ttl is doing it for you (unused thumbnails just dies) and you can
change your template without having to recreate all the thumbnails again,
just do it on the fly

I think that ES is missing the expiration header also by default, the get
op at least should have the last modified header to just do a head and see
if the doc has changed in the meantime to download it or not, that would
save a lot of network traffic and faster recovery for ES since netty dont
have to return the whole doc...

Best

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · May 23, 2013, 8:00pm

Wow impressive!

Just a few questions, do you consider testing with Java 8? No more perm
gen in Java 8
http://mail.openjdk.java.net/pipermail/hotspot-dev/2012-September/006679.html

And do you plan to separate binary from image metadata in the ES index?
Would be nice, also for retrieval performance. Maybe with a custom
Lucene 4 codec?

Do you think about some streaming API, maybe with websockets? I am also
interested in such an ES extension which can add persistent connections
to move data chunks around, also binary chunks, not only JSON.

Jörg

Am 23.05.13 03:00, schrieb David G Ortega:

About perfomance? Well, ES is quite slow when hitting big files,
dealing with the binary is not a piece of cake and Im dealing with a
lot of permagen glitches.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lukas_Vlcek1 · May 23, 2013, 9:24pm

Hi David,

just wanted to ask: Do you think you can remind me why you want to store
large binary files in Lucene index? Did you consider other options? I would
assume that fast, distributed store with replication for binary data (like
images) is a common requirement for many large web sites and there are
already proven solutions.

When stored in Elasticsearch, did you consider that all the huge data needs
to be moved across the cluster when shards are relocated between nodes?
Apart from mentioned Java GC this can put additional load on your cluster
eating resources needed for fast search IMO.

May be it is a valid use case, I am just thinking loud...

Regards,
Lukas

On Thu, May 23, 2013 at 9:20 AM, David G Ortega g.ortega.david@gmail.comwrote:

Hi Stephane,

the numbers i gave were merely picking just only two fields that I need
"binary" and "type". The transformations are not taken into account since
Im using a cache layer to avoid doing transformations which are not very
expensive apart of creating a BufferedImage which is very slow and painful
in large images.

The process is:

->request the object fields "binary", "format"
->if exists and transformations are needed we transform
->transforms check in its cache, if there just return if not transform
->if not transformations needed just return binary

so http://localhost:9200/_imagenii/imagenii/4.jpg
is going to return the posted original image and this
http://localhost:9200/_imagenii/imagenii/4.p http://localhost:9200/_imagenii/imagenii/4.jpg
ng
if returning a png version of the image. To do this I have to create a
BufferedImage from the binary changing first the color space to not
overkill the bufferedimage which does not perform well with not sRGB color
spaces and write back to a byte array, this only happens if i dont have it
in the cache already whick in that case I just simply return the cache.

the cache layer in determined by a ttl parameter, when you store the
tranformed image in the cache is going to live for the given ttl and of
course returning the proper expiration time to not being requested by the
browser again like amazon for example.

With imagenii you create thumbnails with steroids, no need to maintain
them since the ttl is doing it for you (unused thumbnails just dies) and
you can change your template without having to recreate all the thumbnails
again, just do it on the fly

I think that ES is missing the expiration header also by default, the get
op at least should have the last modified header to just do a head and see
if the doc has changed in the meantime to download it or not, that would
save a lot of network traffic and faster recovery for ES since netty dont
have to return the whole doc...

Best

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · May 25, 2013, 3:41pm

Hi guys,

@Jörg Prante
"And do you plan to separate binary from image metadata in the ES index?"

No, both are toguether but you can specify the fields in the search exactly
as it is in ES and
by default binary field is not included in the search response.

The suggestion of java 8 is cool. TO be fair i have no idea since the code
I put on top like the imaging,
image descriptors, k-means, etc... are supported by java 8 but of course
the whole system relies in es so
depends much more of ES than imagenii.

I have been thinking about streaming for huge images but I think that the
final ending of imagenii is
going to be an imaging search server centered in replication, imaging, and
any kind of search.

@Lukas
"Do you think you can remind me why you want to store large binary files in
Lucene index?

I think that ES has been designed in mind to store the files... a search
engine is basically a bunch of docs
with an inverted index so if its a image or an HTML doc does not matter.
Take in mind that the binary is not
indexed so the inverted index should be minimal. Lucene is also much more
smart deleting huge index
than your OS handling the deletion of a huge dir of files (at least
windows). And my KISS mind refuses
to put too much technologies in the equation...

Sharding to me is bad idea... My first and only approach to imaginii is
having 0 shards, the index is defined with
"auto_expand_replicas" : "0-all",
sharding should only be the last thing/approach to do, just shard when you
dont
have more options... at the minute I index 50GB of images per day (Im
developing a classifieds search engine aswell)
and I can say that the solution is fast enought and its replicating my
images across the nodes so I dont have
to use NFS or NAS or dedicated server. Apart from the metadata the data of
the index is minimal for
every image, minimal but a really really powerful BOVW.
My portal uses imagenii for serve the thumbs and search by visual
similarity to remove duplicated posts (spam bots).
The average search is under 200ms. I was using an inhouse distributed
BK-tree but ES is fast enought.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · May 27, 2013, 10:52pm

Lucene's doc deletion is very expensive and can't compete with filesystem
deletion. For storing large data, Lucene's document store codec works well
as long as you don't start random deletes here and there. Lucene will move
the whole index for that sooner or later although only one doc was
deleted...

With FlexibleIndexing - Apache Lucene (Java) - Apache Software Foundation you could try to
implement a custom codec that can take better care of binary files in
stored fields, especially for already compressed image data. Maybe a COW or
append-only codec which never moves data chunks or renumbers docs.

As you can see in this blog post
http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene
there are new possibilities to manage the I/O of reading/writing binary
data. Note, in Lucene 4, compression of stored fields is now the default,
which may not always match well with compressed image data.

Jörg

On Saturday, May 25, 2013 5:41:00 PM UTC+2, David G Ortega wrote:

Take in mind that the binary is not
indexed so the inverted index should be minimal. Lucene is also much more
smart deleting huge index
than your OS handling the deletion of a huge dir of files (at least
windows). And my KISS mind refuses
to put too much technologies in the equation...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · May 27, 2013, 11:49pm

Awesome info Jörg!!

I having a look... Thanks!!

Lets see how imagenii works with the latest version, at the minute is
working with 0.19 perfectly but 0.90 was giving me a classnot found
exception. I have to look at it.

I consider lucene much more smart deleting since the deletion is
backgrounded (maybe Im wrong), anyway the data has to be in the index since
Im letting ES to replicate the data. Thats one of my specs, images are
replicated across nodes so you are never using NFS, or any distributed FS
to have the images replicated. An imagenii cluster with anycast ip and you
have a CDN with imaging and search including visual search, I think thats
pretty cool

Imagenii key features:

1 - Image replication across the nodes, with an anycast IP you have a CDN
out of the box.
2 - Imaging server that allows you to do image tranformations with a
developer chain API in mind. Crop, resize, rotation, flip, canvas, brighth,
contrast, color, sharp, crisp, blur, solarize... Up to 150 effects and
filters available.
3 - Data augmentation with geo (through metadata using GPS of NER), date,
author, camera details, bounding boxed tags (for face detection or manual
posted bboxes), Color extraction, feaures extraction, BOVW... so you can
search by text or image BOVW or directly against an image for visual search.

No more, no less

El miércoles, 13 de marzo de 2013 09:40:59 UTC+1, Stephane Bastian escribió:

Hello All,

I know the idea of replacing GlusterFs with ES may sound funny... or even
plain silly but let me give you some background first

We've been using ES for almost 2 years now and are extremely pleased with
the result. We first started to use it for its search capability, and soon
realized that we could also use it as our main data store. So we stopped
using mongoDb and have been relying solely on ES to store and search. This
is working fine. Performance is excellent, the Api is great and on top of
that operating ES on a production server is a pleasure (unlike a lot of
products out there).

Now, our application allows users to save pictures (JPG, PNG, SVG, etc..).
Each time a picture is saved, a background thread creates several versions
of the picture (mobile, tablet, desktop low-res, desktop high-res). This
way, we can easily send a picture that's optimized for the user's device.

We currently have done some experiment with GlusterFs and it's working
fine. However we would like to keep operations light and smooth and
would prefer to keep our technology stack at a minimum. And this is how
we came up with the idea of also using ES to store binary data.

However we do not have any experience storing binary data with ES and know
that this is not what it was meant to do best.

The typical use case would be :

Storing binary data

Getting binary data by id

No search, query or facets

Pros:

We would keep the same technology stack, easing our operations
(upgrading to new versions, monitoring, etc...)

We would use the distributed nature of ES and some of its functionality
for this particular use case (number of replica, etc) that on paper look
like a good fit.

Cons:

ES was clearly not designed to handle this specific use case.

Has anyone on this list done something similar or related?
What do you guys think? Do you thing ES would be a good fit even if it's
not its main strenght?

Thanks in advance for your feedback!

Stephane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
ES for primary image storage Elasticsearch	2	1322	July 6, 2017
Storing binary data in ES Elasticsearch	1	1133	July 10, 2019
I am totally new to elasticsearch and using version 7. I have to save images into elasticsearch Elasticsearch	2	310	March 13, 2020
Sharding Best Practices for Binary Non-Indexed Data Elasticsearch	3	1034	July 6, 2017
ES + Hadoop = primary datastore? Elasticsearch es-hadoop	5	1284	July 6, 2017

Using ES as a distributed datastore to only store binary data (mainly JPG, PNG, SVG), basically replacing our use of GlusterFs

HTH

HTH

HTH

Related topics