My problem is to search data of thousands of users, e.g. mailboxes. Almost all the time search is filtered by user id. How this locality of searches could be taken into consideration? I'm trying to achieve performance comparable to a case where each user has dedicated index.
Sharding is not an option because it will be used (total number of users ~ 1M), and I'm looking for a solution to use inside a shard of ~4k users.
In order to ensure a minimal number of shards need to be queried, you can use routing when indexing and searching. This can help improve performance and increase the number of queries the cluster can support.
What does your indexing and sharing strategy look like?
Thanks for your reply! I shard by user id, and route search to user's shard. On datacenter level everything looks fine, I suppose.
Thanks for replying!
Yes, cached filters is my plan B. But as far as I understand it still incurs serious performance penalty compared to index-per-user case. Lets consider search for one term with filter user_id: 5. Elasticsearch would have to intersect inverted index term block and cached filter. Term block (FOR) could contain significant part of documents in the collection. In case of 1000 users cached filter (roaring bitmap) on average contains 1/1000 of documents in the collection. So, on average we'll have to skip to every 1000th document in FOR. It's faster than enumeration of all document in FOR, but still slower than enumeration of hypothetical dedicated FOR of documents which belongs to user 5.
My thoughts are based on https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps.
How could we speed up this intersection? We could somehow group assignment of lucene internal document ids by user. So documents of user 1 is 1 .. 10000, documents of user 2 is 10001 .. 13120, documents of user 3 is 13121 .. 25634 and so on. In this case iteration in FOR would skip large amount of documents before user_N's documents, iterate over filtered documents, and skip the rest. In theory now we're having iteration over kinda-dedicated-to-user-N-FOR and one or two major FOR skips, which is hopefully quite effective in terms of CPU and IOPS. As for way of grouping documents in FOR by user, afaik lucene allows me to add documents in batch and have monotonic document ids (via IndexWriter.addDocuments method), and it'll survive segments merging.
Of course, some degree of bloating is fine, for example because of some new document additions.
This leads me to a question - how could I ensure that indexing of user documents in Elasticsearch will monotonically add them to lucene index so they would have monotonic internal ids and as a result will have good locality in FOR? Does correct order of index API requests suffice?
Your analysis is correct. Index requests are currently mapped onto
indexWriter.addDocuments calls (see here). Ordering them by user will thus give good locality in FOR. If you want stronger guarantees in presence of new document additions, you would currently need to implement a custom SortingMergePolicy for ES. With index sorting having just become a first-class citizen in Lucene 6.2 we are working towards exposing this functionality in a configurable way in ES.
Great, sounds like a solution, thanks!