Search source code and also use as store for them at the same time

ice3man543 · February 14, 2020, 12:47pm

I am building a search engine for some internal code and we decided to use elasticsearch. The issue is i also need access to those documents directly with an ID while providing search on them.

Reading around, some forums suggested that i should never use it for this purpose. Only search with elasticsearch and use something like HDFS or couchbase for storing those blobs. Any ideas?

There will be lots of small files and therefore i'm reluctant to store the same data twice.

Armin_Braun · February 14, 2020, 1:31pm

Hi @ice3man543,

if you're worried that individual source files will be too large as single documents I think it is safe to say that you'll be fine. There is some discussion around the topic of max document size Maximum document size (albeit pretty old, but the general concepts around Lucene still apply).
Since you're searching on these files anyway, you'll have to index them in full I would assume (if you want to offer full text search on the source) so there is neither a need nor a point in storing them redundantly in another database IMO .

ice3man543 · February 14, 2020, 1:34pm

Hi @Armin_Braun, my concerns aren't about document size, those documents will be quite small under 5 mb. Its mostly about the number of documents that will be there. If storing the files and retrieving the original source will be doable with the ID of the document, that's all i need. Thanks for the prompt reply! Any thing to worry about while reading the source field for doc under 5 mb?

ice3man543 · February 14, 2020, 1:38pm

Also, @Armin_Braun my document retrievals will be quite intensive in load, along with providing search on the index? Do you think that's something i should be worried about? Will retrieving too many documents at same time will put a cost on search performance?

ice3man543 · February 14, 2020, 1:44pm

I had one more query, regarding general use. There'll be number of workers in the microservices based architecture i'm working in which have to insert data to elasticsearch. Should i use a central worker to collect data from all those nodes and push it to elastic or should those workers in parallel index data to elastic? What would be better?

Armin_Braun · February 14, 2020, 1:53pm

No problem @ice3man543!

These two issues are a function of how much memory your nodes have available in addition to the configured JVM heap size. There is a bit of background on this here but what it boils down to is this:
The document source gets loaded from disk. Disk is fast if you have enough RAM so that the file system cache is used relatively often compared to physical disk reads and so are searches that need to load things from disk.
=> The more RAM you have and the faster your disks the less of an issue this is so as long as you size your nodes accordingly this should be fine.

The number of documents is not an issue. The only limit you have to keep in mind here is that you can only have ~2B (32 bit signed int max) documents per shard because Lucene uses int document ids. So you have to keep this number in mind when deciding on how many shards to use per index and make that number large enough but that's it.

Without having more quantitative details here I would say you're most likely good to just have all those workers work independently. The important thing to look at here is the number of documents you will be indexing in a single bulk request. Try to make the individual workers send bulk requests of multiple documents if possible but unless we're talking about an extreme case here of thousands of workers or so this should not be an issue. In your case in particular, the bulk size should probably be chosen somewhat on the smaller end of things because of the expected slightly larger document size, so manually batching seems even less useful in your case.
=> it's pretty unlikely that this will be an issue I think

ice3man543 · February 14, 2020, 4:20pm

Thanks for the help. Appreciated!

system · March 13, 2020, 4:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How expensive is the Source Filtering? Elasticsearch	5	1899	December 10, 2018
Help: Is ElasticSearch the right tool for us? Elasticsearch	2	330	July 6, 2017
Possible optimisations for large _source documents Elasticsearch	7	595	July 5, 2017
High Scale Use Case Elasticsearch	6	533	January 18, 2017
Questions relating to elastic search Elasticsearch	3	925	July 6, 2017

Search source code and also use as store for them at the same time

Related topics