Hey all,
I'm trying to come up with a system for using ES to index a large number of
indices (tens of thousands to hundreds of thousands), and am looking
towards the mailing list to see if anybody has any ideas or suggestions.
The properties are:
- The data source is bursty, rather than a steady stream.
- Because I don't want new data to affect old results, I don't think I can
really have a single huge index sliced/faceted by some key. That is, if I
had a single huge index that would mean that all the data is in one TF/IDF
table, and search results would look a bit weird. - I expect that only a few indices will need to be open and ready for
search at any given time. Older indices would probably never even get
opened, but we need to keep them around just in case somebody wants it.
The best I've come up with so far is a system where I keep all indices
closed initially, open them for the bursty write, do a refresh and an
optimize (to make sure the new data is searchable), and then close it
again. In this system, I'm saying that it is the responsibility of the
client to open the appropriate index whenever a search request comes.
In test setups, this seems to work ok, but performance (both on the search
and the index side) is a bit lacking (multi-second searches and indexing).
I'm starting to make my way through the codebase to see what parts of ES
internally need to be tweaked to support this kind of scenario, but it
would be really useful if somebody could name a class or two that I could
focus on first.
In particular, sometimes a search might hang (turning what is normally a
subsecond search into tens to hundreds of seconds) because some indexing is
going on, which doesn't quite make sense to me. Are there some global data
structures in place around opening, closing, and refreshing and optimizing
indices, such that they could be tuned to perform better for this kind of
usage?
Also, are there any other gotchas I should consider for the future? One
thing I can think of is that having hundreds of thousands of directories
(indicies) might not be the best thing from a filesystem perspective, but
I'm really not sure. Is there something obvious I'm not considering?
If anybody has any opinions, it'd be greatly appreciated.
Cheers!
Matt