we want to clarify the optimal index design for our special application requirements.
What I've understood from sharding is that it's like hash partitioning a database table (eg. in Oracle).
The effect would be that queries can be parallelized over the cluster nodes. Every single query result
will then be collected and delivered to the client in a consolidated manner. So far so good.
We use a special index for simple identifier lookups. The identifier values (bold faced in the mapping below) are unique and will be used to search for the documentId.
Dividing this index in say n primary shards would obviously lead to (n-1) superfluous queries.
So our thought is to divide that index logically in one index per identifier. Each separate (identifier) index
should reside in exactly one primary shard and as many replica shards as cluster nodes are available.
We expect a good load balancing by distributing the client queries over the cluster nodes.
Would you agree to this considerations or are there any drawbacks?
The default search over an index will be distributed over the shards of an index, not the cluster nodes. The hash of the document ID determines the shard ID.
I do not understand. Elasticsearch can filter and route by given query terms and prevents execution of sub-queries with obviously empty results.
For example, if you use the Elasticsearch document ID (the _id field) with the value of the documentID (assuming documentID is unique) and search for the _id only, Elasticsearch will project your query to exactly that shard where the document is located.
That looks like you intend to index documents which have a relation to the same documentID, so documentID seems not unique.
Have you checked the parent/child model? The pros/cons and improved performance depend on what kind of queries you want to execute.
One index per identifier is impossible if you have more than say a few hundred identifiers, because an index consumes quite a lot of resources, each shard is a complete Lucene index.
Thank you for replying. Our documentId field is not the internal _id field of the stored document, but rather an external id of a musical work. You say that search queries are distributed over shards. Yes, but the effect is the same as (replica) shards are distributed over all cluster nodes and so the queries also.
We do not search with the documentId nor the _id field but eg. "iswc:T8000024045" and that query should normally return exactly one match say with documentId="310017001". As this unique (Lucene) document can reside in only one primary shard I assume strongly that the queries for all shards but one will not return any result.
We expect to have less than 20 such secondary identifiers (iswc, isrc, ...). Do you think, that this will have a big impact on internal resources? Does every index need it's own Lucene instance?
On the other hand we expect to have nearly 30 million isrc values and about 200 million dspwc values and we expect some benefit from separate searches.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.