Our non profit (Chipster.org) is creating offline educational collections for people who lack internet. The collections will be distributed on flash drives and designed so that a user can copy their collection freely to someone else's device.
We've installed the full Elasticsearch suite and indexed 16GB of Web sites and other documents.
We will provide a static index with each collection so users can experience a contemporary search engine.
Question is: how much of these 1GB of folders/and files can we dispense with if all the end user is going to do is basic search?
Our aim is to keep as much search functionality as possible while also saving space on the chips for more content.
Can someone who knows the innards of Elasticsearch advise us on how to optimize our search engine?
Elasticsearch verifies that the data on disk is complete and consistent, so never ever try to remove files or change anything at the file system level. Always use the APIs to manage your data.
I appreciate the advice on how to optimize the index. I suspect this will be an ongoing endeavor as we test the search engine in the field.
My inquiry, though, is even weirder.
What can we do to shrink the footprint of the Elasticsearch installation? I'm guessing that, as with other software I've worked with, there are files in the Elasticsearch folder that are not necessary for our purposes.
Our plan is this:
-- we index (and optimize) the content collection on a machine with a full Elasticsearch installation, unlimited storage, and tons of RAM.
-- after the index has been created, it will not be updated or changed
-- we then distribute the new static index, a search UI, and the necessary Elasticsearch code with our content.
Currently, our Elasticsearch folder is over 900MB. Of that, the index is 105MB.
So we have roughly 700 MB of code.
Which brings me back to the my original conundrum: are there Elasticsearch code files that we can dispense with after indexing if we don't intend to add to or alter the index?
I agree, I don't think you can safely just delete stuff from the installation dir. It might work to modify the build to exclude the parts you don't need instead, but IMO this would be pretty tricky to get right and you would struggle to get help with such a nonstandard distribution. For instance, you wouldn't be able to test it with the standard test suite because this will fail if features are missing.
I'm not a lawyer, and certainly not your lawyer, but I also worry that you may run into licensing and/or trademark issues with such a modified distribution. If so, I'm sure this is surmountable, but it's a question of whether it's the best use of your resources given that 512MiB flash drives are not all that much cheaper than 1GiB ones.
Could you use a compressed filesystem instead? Most of the installation compresses pretty well, so I think you might fit everything into 512MiB that way.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.