Dec 2nd, 2021: [en] Generating document thumbnails - now Open Source on the JVM

Sean_Story · December 2, 2021, 8:00am

Generating document thumbnails - now open source on the JVM

It's something that I'd always taken for granted. You work with any sort of file system/file browser/search engine and you just expect to see thumbnail images. It makes perfect sense - they say that "a picture is worth 1000 words", and ain't nobody got time to read 1000 words. So thumbnails have become pervasive. You see them in Google Drive, in Dropbox, on your desktop. So if thumbnails are so pervasive, they must be super easy to generate, right?

Wrong.

Oh my gosh, so wrong.

When we started looking to add thumbnails to the Elastic Workplace Search product, our first stop was Stack Overflow. They're usually pretty diligent about closing "duplicate" posts, but we found dozens of variations of folks wanting to do the same "simple" thing - given a file, produce a thumbnail, in Java.

Why so many questions with so many different answers? Well, because turns out there are a lot of different file formats, even among ones we think of as "the same." Let's consider Word Documents built by Microsoft software, shall we? You've got .doc and .docx file extensions. Just two types, right? Wrong again. Those break down into 7 MIME Types:

application/x-tika-msoffice
application/msword
pplication/x-tika-ooxml
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-word.template.macroenabled.12
application/vnd.ms-word.document.macroenabled.12

Ok, so 7 formats. Not that bad.
Except it is. Because that's just how the documents are represented in binary - not how they're displayed. Open Office will open your .docx files a little differently from Word, a little different from Google Drive, a little different from Dropbox... Maybe now we begin to grasp the problem.

As to the solutions, there are a lot out there. PDFTRON, Aspose and Sferyx all sell products that can help here. JODConverter and Documents4j both require network connections to other software services (Apache Open Office, Libre Office, or Microsoft Office), the hosting of which eats into "free". And while there are a few libraries that don't require payment or hosted services (like iText and XDocreport), their licenses restrict you from using their solutions in many instances.

We never did find a true open-source solution for generating thumbnails. So we built one. Originally, we baked it into our closed-source Enterprise Search codebase. Powered mostly by Apache Poi, Apache PDFBox, and core Java libraries, we set about finding the best way to convert original file binaries into a format we could render (usually HTML), before creating an appropriately sized and scaled BufferedImage to output. We released thumbnails originally in 7.9.0, and have been battle-hardening them ever since.

And now, we reach the final stage of this story. Elastic has key cultural values that we call our source code, and one of those values is "Space, Time." Our website explains it as:

It’s easy to get stuck in a day-to-day work pattern. Allowing for the space and time to dream requires conscious effort. Embracing a high failure rate does, too.

Fulfillment comes from doing the obvious and dreaming up the un-obvious. Both are foundations of Elastic.

One of the ways this concretely manifests here is that Engineers are encouraged to take up to 1 week every 3 months to just work on something that interests them. In my most recent "Space Time Week," I pursued open-sourcing our thumbnails solution. At the end of that week, the open sourcing project had momentum.

In true Elastic fashion, we've all pulled together to see the prototype of that Spacetime week come to fruition, and to commit to maintaining it. The result is Thumbnails4j (GitHub - elastic/thumbnails4j), the dirt simple Java library for generating thumbnails. Have a word document that you need a thumbnail for? It only takes 6 lines of code:

import co.elastic.thumbnails4j.core.Dimensions;
import co.elastic.thumbnails4j.docx.DOCXThumbnailer;

File input = new File("/path/to/my_file.docx");
Thumbnailer thumbnailer = new DOCXThumbnailer();
Dimensions outputDimensions = new Dimensions(100, 100);
BufferedImage output = thumbnailer.getThumbnails(input, outputDimensions).get(0);

We currently support doc, docx, pptx, xls, xlsx, pdf, jpg, jpeg, png, gif, and html. But it's open source (Apache 2.0 license), so if your favorite format isn't included, send us a change request! Together, let's make sure that this common problem has an open source solution that meets everyones needs.

Charlie_Hull · December 2, 2021, 1:12pm

Neat! I helped build something similar in Python many moons ago that used headless OpenOffice.

Minor comment though - JODConverter uses OpenOffice / LIbreOffice which aren't "paid software" as you write above, they're free and open source. So I'm not sure why you discounted them.

Sean_Story · December 2, 2021, 3:18pm

Good catch! Thanks. I've edited it. My notes had said we'd discounted JODConverter due to the "Cost of OpenOffice/LibreOffice", but you're right that it's not a software cost, it's the hosting and maintenance cost of their service. They can't just be pulled in as libraries - JODConverter needs a network connection to a server.

Charlie_Hull · December 2, 2021, 3:49pm

Ah I get it - I think when we expiremented there was some way to connect to OpenOffice as a library, but perhaps things have changed since. Thanks for updating!

system · December 30, 2021, 3:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OCR Plugin Elasticsearch	6	3710	July 6, 2017
Full list of supported document formats by ES Elasticsearch	5	7190	May 2, 2017
How to index Office files? .txt and .pdf are working Elasticsearch	7	2797	July 6, 2017
Not able to fulltext index Microsoft Office documents - PDF works fine Elasticsearch	6	516	July 6, 2017
Attachment indexing breaks shards or hangs during indexing Elasticsearch	5	888	July 6, 2017

Dec 2nd, 2021: [en] Generating document thumbnails - now Open Source on the JVM

Generating document thumbnails - now open source on the JVM

Related topics