ElasticSearch and parallelization

Dave_O · February 11, 2013, 12:39pm

I have terabytes of data that I need to index. In my environment I have
lots and lots of data but minimal concurrent queries. I'm curious how
parallelization works with ES.

This is a hypothetical scenario. Lets say I have US state data that
logically can be broken into 52 parts. For arguments sake lets say that
the data counts are equivilent in all the states. Lets also assume I have
52 Nodes/Servers

The 2 scenarios I'm looking at are

Creating 1 large index with many shards. Lets say 52 in this case. (and
putting on separate servers)
Or I could create 52 separate indexes.

The questions I have.

When searching on scenario 1 (52 shards) and my query spands all 52
shards will this run in parallel. (ie, will I get 52 parallel threads)
Similarly, If I query all 52 indexes with the same query such as
index1,index2,index3...index52 will this kick off 52 parallel processes.

Is there a way to control the parallelization in either scenario?

Any opinions on which option be faster or a better configuration for both
searches and indexing?

In either case I could direct single state queries to the appropriate
Shard or Index by either using routing or just going directly to the
appropriately index.

But routinely I will have queries that span some or all of the states.
From everything I can tell the queries will be parallelized...but
wondering to what extent.

Would the scenario be the same if I had all 52 Nodes on the same server.
(Ie, would still get 52 parallel processes in both cases?)

Thanks for your guidance.
.http://elasticsearch-users.115913.n3.nabble.com/Question-about-parallelization-td4029620.html#

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · February 11, 2013, 12:54pm

Hi Mike

This is a hypothetical scenario. Lets say I have US state data that
logically can be broken into 52 parts. For arguments sake lets say
that the data counts are equivilent in all the states. Lets also
assume I have 52 Nodes/Servers

The 2 scenarios I'm looking at are

Creating 1 large index with many shards. Lets say 52 in this case.
(and putting on separate servers)

Or I could create 52 separate indexes.

The questions I have.

When searching on scenario 1 (52 shards) and my query spands all 52
shards will this run in parallel. (ie, will I get 52 parallel
threads)

Similarly, If I query all 52 indexes with the same query such as
index1,index2,index3...index52 will this kick off 52 parallel
processes.

Searching 52 indices with 1 shard each is exactly equivalent to
searching one index with 52 shards. The search happens at shard level.

All shards would be searched in parallel.

Is there a way to control the parallelization in either scenario?

As in to limit the number of threads? You could look at configuring the
thread pool:

but I'm not sure if that would apply to the number of searches per
shard, or for distributed search as you describe above.

Any opinions on which option be faster or a better configuration for
both searches and indexing?

Exactly equivalent.

In either case I could direct single state queries to the appropriate
Shard or Index by either using routing or just going directly to the
appropriately index.

Correct

But routinely I will have queries that span some or all of the states.
From everything I can tell the queries will be parallelized...but
wondering to what extent.

The node that handles the request will "scatter" the request to all
relevant shards in parallel, gather their results and return the
collated results.

Would the scenario be the same if I had all 52 Nodes on the same
server. (Ie, would still get 52 parallel processes in both cases?)

This I'm not sure about. I think this is where the threadpool config
would kick in to limit the number of concurrent searches.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Aggregations Parallelism Level Elasticsearch	1	352	July 6, 2017
How does Elasticsearch process a query? Elasticsearch	7	5555	August 17, 2019
Multiple data directories ->parallel search of shards on same instance? Elasticsearch	6	3400	July 5, 2017
Parallelism Per Request Elasticsearch	2	676	August 30, 2018
Search multiple indices Elasticsearch	4	381	January 2, 2022

ElasticSearch and parallelization

Related topics