I am building a search engine for some internal code and we decided to use elasticsearch. The issue is i also need access to those documents directly with an ID while providing search on them.
Reading around, some forums suggested that i should never use it for this purpose. Only search with elasticsearch and use something like HDFS or couchbase for storing those blobs. Any ideas?
There will be lots of small files and therefore i'm reluctant to store the same data twice.
if you're worried that individual source files will be too large as single documents I think it is safe to say that you'll be fine. There is some discussion around the topic of max document size Maximum document size (albeit pretty old, but the general concepts around Lucene still apply).
Since you're searching on these files anyway, you'll have to index them in full I would assume (if you want to offer full text search on the source) so there is neither a need nor a point in storing them redundantly in another database IMO .
Hi @Armin_Braun, my concerns aren't about document size, those documents will be quite small under 5 mb. Its mostly about the number of documents that will be there. If storing the files and retrieving the original source will be doable with the ID of the document, that's all i need. Thanks for the prompt reply! Any thing to worry about while reading the source field for doc under 5 mb?
Also, @Armin_Braun my document retrievals will be quite intensive in load, along with providing search on the index? Do you think that's something i should be worried about? Will retrieving too many documents at same time will put a cost on search performance?
I had one more query, regarding general use. There'll be number of workers in the microservices based architecture i'm working in which have to insert data to elasticsearch. Should i use a central worker to collect data from all those nodes and push it to elastic or should those workers in parallel index data to elastic? What would be better?
These two issues are a function of how much memory your nodes have available in addition to the configured JVM heap size. There is a bit of background on this here but what it boils down to is this:
The document source gets loaded from disk. Disk is fast if you have enough RAM so that the file system cache is used relatively often compared to physical disk reads and so are searches that need to load things from disk.
=> The more RAM you have and the faster your disks the less of an issue this is so as long as you size your nodes accordingly this should be fine.
The number of documents is not an issue. The only limit you have to keep in mind here is that you can only have ~2B (32 bit signed int max) documents per shard because Lucene uses int document ids. So you have to keep this number in mind when deciding on how many shards to use per index and make that number large enough but that's it.
Without having more quantitative details here I would say you're most likely good to just have all those workers work independently. The important thing to look at here is the number of documents you will be indexing in a single bulk request. Try to make the individual workers send bulk requests of multiple documents if possible but unless we're talking about an extreme case here of thousands of workers or so this should not be an issue. In your case in particular, the bulk size should probably be chosen somewhat on the smaller end of things because of the expected slightly larger document size, so manually batching seems even less useful in your case.
=> it's pretty unlikely that this will be an issue I think
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.