I have terabytes of data that I need to index. In my environment I have
lots and lots of data but minimal concurrent queries. I'm curious how
parallelization works with ES.
This is a hypothetical scenario. Lets say I have US state data that
logically can be broken into 52 parts. For arguments sake lets say that
the data counts are equivilent in all the states. Lets also assume I have
52 Nodes/Servers
The 2 scenarios I'm looking at are
Creating 1 large index with many shards. Lets say 52 in this case. (and
putting on separate servers)
Or I could create 52 separate indexes.
The questions I have.
When searching on scenario 1 (52 shards) and my query spands all 52
shards will this run in parallel. (ie, will I get 52 parallel threads)
Similarly, If I query all 52 indexes with the same query such as
index1,index2,index3...index52 will this kick off 52 parallel processes.
Is there a way to control the parallelization in either scenario?
Any opinions on which option be faster or a better configuration for both
searches and indexing?
In either case I could direct single state queries to the appropriate
Shard or Index by either using routing or just going directly to the
appropriately index.
But routinely I will have queries that span some or all of the states.
From everything I can tell the queries will be parallelized...but
wondering to what extent.
Would the scenario be the same if I had all 52 Nodes on the same server.
(Ie, would still get 52 parallel processes in both cases?)
This is a hypothetical scenario. Lets say I have US state data that
logically can be broken into 52 parts. For arguments sake lets say
that the data counts are equivilent in all the states. Lets also
assume I have 52 Nodes/Servers
The 2 scenarios I'm looking at are
Creating 1 large index with many shards. Lets say 52 in this case.
(and putting on separate servers)
Or I could create 52 separate indexes.
The questions I have.
When searching on scenario 1 (52 shards) and my query spands all 52
shards will this run in parallel. (ie, will I get 52 parallel
threads)
Similarly, If I query all 52 indexes with the same query such as
index1,index2,index3...index52 will this kick off 52 parallel
processes.
Searching 52 indices with 1 shard each is exactly equivalent to
searching one index with 52 shards. The search happens at shard level.
All shards would be searched in parallel.
Is there a way to control the parallelization in either scenario?
As in to limit the number of threads? You could look at configuring the
thread pool:
but I'm not sure if that would apply to the number of searches per
shard, or for distributed search as you describe above.
Any opinions on which option be faster or a better configuration for
both searches and indexing?
Exactly equivalent.
In either case I could direct single state queries to the appropriate
Shard or Index by either using routing or just going directly to the
appropriately index.
Correct
But routinely I will have queries that span some or all of the states.
From everything I can tell the queries will be parallelized...but
wondering to what extent.
The node that handles the request will "scatter" the request to all
relevant shards in parallel, gather their results and return the
collated results.
Would the scenario be the same if I had all 52 Nodes on the same
server. (Ie, would still get 52 parallel processes in both cases?)
This I'm not sure about. I think this is where the threadpool config
would kick in to limit the number of concurrent searches.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.