I've launched elasticsearch server on Linux. I've put some data to
server (19 indexes, 176 messages per each index, work directory is
about 39 MB). While attempting to put some new data (on 20th index)
the server is not responding - curl returns "Empty reply from server",
no related log is written to logs (even on DEBUG mode) (more precisely
there are written logs only related with creating new index:
[15:37:38,484][INFO ][cluster.metadata ] [Spider-Girl]
Creating Index [nuser101], shards [5]/[1]
[15:37:38,484][DEBUG][cluster.service ] [Spider-Girl] Cluster
state updated, version [145], source [create-index [nuser101]]
[15:37:38,484][DEBUG][indices.cluster ] [Spider-Girl] Index
[nuser101]: Creating
[15:37:38,484][DEBUG][indices ] [Spider-Girl]
Creating Index [nuser101], shards [5]/[1]
[15:37:38,597][DEBUG][index.mapper ] [Spider-Girl]
[nuser101] Using dynamic [true] with location [null] and source [{ default : {
}
}]
[15:37:38,597][DEBUG][index.cache.filter.soft ] [Spider-Girl]
[nuser101] Using weak filter cache with readerCleanerSchedule [1m]
[15:37:38,597][DEBUG][gateway ] [Spider-Girl] Writing
to gateway
, on next attempt no logs will be written, because index already
exists).
Server is still responding for GET request.
I've no idea where the problem is (on other machines it's the same).
I'm begging for help
Content of elasticsearch.in.sh:
CLASSPATH=$CLASSPATH:$ES_HOME/lib/*
Arguments to pass to the JVM
java.net.preferIPv4Stack=true: Better OOTB experience, especially
There is a hard limit of the number of shards each node can have (basically,
the number of shards each node gets allocated to), and its set to 100. I
think that you ran into this limit :). Are you creating an index per user?
In this case, it might make sense to configure it with less shards, even 1
(with 1 replica) would be good. There is the cluster status and indices
status APIs which return you the allocation status of all the different
shards.
This is not currently exposed to be set in the elasitcsearch.yml
configuration (you can open an issue and I will fix this, it should be
exposed!), but for now, you can simply head over to the metadata file (the
last one) created by the gateway service after you shutdown all the nodes,
and change it there (its a simple json file with
the max_number_of_shards_per_node set there.
I've launched elasticsearch server on Linux. I've put some data to
server (19 indexes, 176 messages per each index, work directory is
about 39 MB). While attempting to put some new data (on 20th index)
the server is not responding - curl returns "Empty reply from server",
no related log is written to logs (even on DEBUG mode) (more precisely
there are written logs only related with creating new index:
[15:37:38,484][INFO ][cluster.metadata ] [Spider-Girl]
Creating Index [nuser101], shards [5]/[1]
[15:37:38,484][DEBUG][cluster.service ] [Spider-Girl] Cluster
state updated, version [145], source [create-index [nuser101]]
[15:37:38,484][DEBUG][indices.cluster ] [Spider-Girl] Index
[nuser101]: Creating
[15:37:38,484][DEBUG][indices ] [Spider-Girl]
Creating Index [nuser101], shards [5]/[1]
[15:37:38,597][DEBUG][index.mapper ] [Spider-Girl]
[nuser101] Using dynamic [true] with location [null] and source [{ default : {
}
}]
[15:37:38,597][DEBUG][index.cache.filter.soft ] [Spider-Girl]
[nuser101] Using weak filter cache with readerCleanerSchedule [1m]
[15:37:38,597][DEBUG][gateway ] [Spider-Girl] Writing
to gateway
, on next attempt no logs will be written, because index already
exists).
Server is still responding for GET request.
I've no idea where the problem is (on other machines it's the same).
I'm begging for help
Content of elasticsearch.in.sh:
CLASSPATH=$CLASSPATH:$ES_HOME/lib/*
Arguments to pass to the JVM
java.net.preferIPv4Stack=true: Better OOTB experience, especially
Is ElasticSearch prepared to handle (work properly and efficiently)
this limit accelerated to about 10 000? Only part of indexes would be
used intensively but I need shard limit per node more more bigger than
100.
If you have a lot if indices (10k), then searching across all of them does
not make sense (it will take forever). You can certainly search each time
across N number of indices (where N varies depending on the query and the
number of shards each index has, but not in the 100s).
In therms of elasticsearch itself (I discussed this on a different thread as
well), the limit is basically the amount of memory on the node. The cluster
state, which is basically the in memory representation of the fact that
there is an index, and list of shards each index has, and which nodes its
assign to (and a bit more), basically the meta data, is kept in memory. I
have not tried just creating N number of indices, and taking a memory count
before and then after to get a feeling of how much N indices take.
What is the main reason that you want to have an index per user?
cheers,
shay.banon
On Mon, Apr 12, 2010 at 6:00 PM, Łukasz Osipiuk lukasz@osipiuk.net wrote:
If you have a lot if indices (10k), then searching across all of them does
not make sense (it will take forever).
Every query we send to ES uses only one index. We are not planning
to execute queries which need to
touch more than one as we index users' private data to which other
users have no access.
You can certainly search each time
across N number of indices (where N varies depending on the query and the
number of shards each index has, but not in the 100s).
In therms of elasticsearch itself (I discussed this on a different thread as
well), the limit is basically the amount of memory on the node. The cluster
state, which is basically the in memory representation of the fact that
there is an index, and list of shards each index has, and which nodes its
assign to (and a bit more), basically the meta data, is kept in memory.
And what with open index files? Does ES have some mechanism to limit
this number (e.g. some LRU garbage collector which
closes index files which are not supposed to be accessed in the near future).
And is number of threads accessing index files limited? Or with lots
of concurrent queries touching different indices
we will get degraded performance due to highly concurrent disc access?
Sorry if these are trivial question to which answeres are available in docs.
I have not tried just creating N number of indices, and taking a memory count
before and then after to get a feeling of how much N indices take.
What is the main reason that you want to have an index per user?
Main reason is architecture of application we would like to use ES in.
We have users clustered into groups. Generally user is stick to one
group of servers but if data of users in one group grows too big
we have to move some of them to new cluster. And we would like to be
able to easily move ES data. I guess if we use index per
user approach all we have to do is copy its index files to another ES instance.
If you have a lot if indices (10k), then searching across all of them
does
not make sense (it will take forever).
Every query we send to ES uses only one index. We are not planning
to execute queries which need to
touch more than one as we index users' private data to which other
users have no access.
Sounds good. Just note what I explained in the previous email regarding the
cluster state and in memory storage of the meta data for each index. I would
test first that 10k indices can be handled by the amount of memory you plan
to assign to each node.
You can certainly search each time
across N number of indices (where N varies depending on the query and the
number of shards each index has, but not in the 100s).
In therms of elasticsearch itself (I discussed this on a different thread
as
well), the limit is basically the amount of memory on the node. The
cluster
state, which is basically the in memory representation of the fact that
there is an index, and list of shards each index has, and which nodes its
assign to (and a bit more), basically the meta data, is kept in memory.
And what with open index files? Does ES have some mechanism to limit
this number (e.g. some LRU garbage collector which
closes index files which are not supposed to be accessed in the near
future).
And is number of threads accessing index files limited? Or with lots
of concurrent queries touching different indices
we will get degraded performance due to highly concurrent disc access?
Sorry if these are trivial question to which answeres are available in
docs.
No, file handles will be opened. This is why I suggested you use a single
shard per user (which will further limit the file handles). The number of
threads accessing the index can be limited by choosing a different thread
pool implementation (the scaling one in the docs). How many servers do you
plan to use to run all those users on?
I have not tried just creating N number of indices, and taking a memory
count
before and then after to get a feeling of how much N indices take.
What is the main reason that you want to have an index per user?
Main reason is architecture of application we would like to use ES in.
We have users clustered into groups. Generally user is stick to one
group of servers but if data of users in one group grows too big
we have to move some of them to new cluster. And we would like to be
able to easily move ES data. I guess if we use index per
user approach all we have to do is copy its index files to another ES
instance.
You can easily use a single index (or use the group as the index name) and
each query you execute you make sure you wrap it in a filter with the user
to match on. Moving indices around indices can be a simple search+scroll
from one index and index to the other. Same applies regarding different
clusters.
If you have a lot if indices (10k), then searching across all of them
does
not make sense (it will take forever).
Every query we send to ES uses only one index. We are not planning
to execute queries which need to
touch more than one as we index users' private data to which other
users have no access.
Sounds good. Just note what I explained in the previous email regarding the
cluster state and in memory storage of the meta data for each index. I would
test first that 10k indices can be handled by the amount of memory you plan
to assign to each node.
We are not sure as we are researching solution yet. We would not like
to throw to much hardware at it - especially in early stage
when every user has little data. Later as indices would grow number of
users per machine will be smaller.
I guess that 50k-100k user index replicas per machine at beginning is
quite good shot.
You can certainly search each time
across N number of indices (where N varies depending on the query and
the
number of shards each index has, but not in the 100s).
In therms of elasticsearch itself (I discussed this on a different
thread as
well), the limit is basically the amount of memory on the node. The
cluster
state, which is basically the in memory representation of the fact that
there is an index, and list of shards each index has, and which nodes
its
assign to (and a bit more), basically the meta data, is kept in memory.
And what with open index files? Does ES have some mechanism to limit
this number (e.g. some LRU garbage collector which
closes index files which are not supposed to be accessed in the near
future).
And is number of threads accessing index files limited? Or with lots
of concurrent queries touching different indices
we will get degraded performance due to highly concurrent disc access?
Sorry if these are trivial question to which answeres are available in
docs.
No, file handles will be opened. This is why I suggested you use a single
shard per user (which will further limit the file handles). The number of
threads accessing the index can be limited by choosing a different thread
pool implementation (the scaling one in the docs). How many servers do you
plan to use to run all those users on?
We will definitely use one shard per user index if we go with index
per user approach.
I have not tried just creating N number of indices, and taking a memory
count
before and then after to get a feeling of how much N indices take.
What is the main reason that you want to have an index per user?
Main reason is architecture of application we would like to use ES in.
We have users clustered into groups. Generally user is stick to one
group of servers but if data of users in one group grows too big
we have to move some of them to new cluster. And we would like to be
able to easily move ES data. I guess if we use index per
user approach all we have to do is copy its index files to another ES
instance.
You can easily use a single index (or use the group as the index name) and
each query you execute you make sure you wrap it in a filter with the user
to match on. Moving indices around indices can be a simple search+scroll
from one index and index to the other. Same applies regarding different
clusters.
Nice to know. We were using SOLR now and AFAIK moving data between
indices is not possible there. What we did was re-indexing
user's data from source when moving user from cluster to cluster or
when we changed SOLR schema.
One more question - i guess search+scroll is probably still slower
than copying index file from machine to machine using rsync. Am I
right?
One more reason we are not very willing to use single index. Isolating
different users makes us sure that documents of other users will not
influence performance of queries concerning one we are interested in.
Sometimes result set before filtering by user id can be huge (which
yields lots of cpu and memory usage). I guess it is possible to state
queries in a way which minimize this problem but as our users have
access to rich query language it might not be easy in every case.
I think, if you are going with lower number of boxes at the beginning,
then having all the users in a single index, or groups of users in the same
index, make more sense. You can make sure each user won't see other
documents by wrapping any query you execute against elasticsearch with a
filtered query that filters based on the user.
The search+scroll solution does reindex the mentioned user data, but it
should be very very fast, especially if you have the index operation be
parallel and not wait for it for the next search. The same cluster can
handle such reindexing since elasticsearch is multi-index / multi-type by
design.
cheers,
shay.banon
On Tue, Apr 13, 2010 at 9:44 AM, Łukasz Osipiuk lukasz@osipiuk.net wrote:
If you have a lot if indices (10k), then searching across all of them
does
not make sense (it will take forever).
Every query we send to ES uses only one index. We are not planning
to execute queries which need to
touch more than one as we index users' private data to which other
users have no access.
Sounds good. Just note what I explained in the previous email regarding
the
cluster state and in memory storage of the meta data for each index. I
would
test first that 10k indices can be handled by the amount of memory you
plan
to assign to each node.
We are not sure as we are researching solution yet. We would not like
to throw to much hardware at it - especially in early stage
when every user has little data. Later as indices would grow number of
users per machine will be smaller.
I guess that 50k-100k user index replicas per machine at beginning is
quite good shot.
You can certainly search each time
across N number of indices (where N varies depending on the query and
the
number of shards each index has, but not in the 100s).
In therms of elasticsearch itself (I discussed this on a different
thread as
well), the limit is basically the amount of memory on the node. The
cluster
state, which is basically the in memory representation of the fact
that
there is an index, and list of shards each index has, and which nodes
its
assign to (and a bit more), basically the meta data, is kept in
memory.
And what with open index files? Does ES have some mechanism to limit
this number (e.g. some LRU garbage collector which
closes index files which are not supposed to be accessed in the near
future).
And is number of threads accessing index files limited? Or with lots
of concurrent queries touching different indices
we will get degraded performance due to highly concurrent disc access?
Sorry if these are trivial question to which answeres are available in
docs.
No, file handles will be opened. This is why I suggested you use a single
shard per user (which will further limit the file handles). The number of
threads accessing the index can be limited by choosing a different thread
pool implementation (the scaling one in the docs). How many servers do
you
plan to use to run all those users on?
We will definitely use one shard per user index if we go with index
per user approach.
I have not tried just creating N number of indices, and taking a
memory
count
before and then after to get a feeling of how much N indices take.
What is the main reason that you want to have an index per user?
Main reason is architecture of application we would like to use ES in.
We have users clustered into groups. Generally user is stick to one
group of servers but if data of users in one group grows too big
we have to move some of them to new cluster. And we would like to be
able to easily move ES data. I guess if we use index per
user approach all we have to do is copy its index files to another ES
instance.
You can easily use a single index (or use the group as the index name)
and
each query you execute you make sure you wrap it in a filter with the
user
to match on. Moving indices around indices can be a simple search+scroll
from one index and index to the other. Same applies regarding different
clusters.
Nice to know. We were using SOLR now and AFAIK moving data between
indices is not possible there. What we did was re-indexing
user's data from source when moving user from cluster to cluster or
when we changed SOLR schema.
One more question - i guess search+scroll is probably still slower
than copying index file from machine to machine using rsync. Am I
right?
One more reason we are not very willing to use single index. Isolating
different users makes us sure that documents of other users will not
influence performance of queries concerning one we are interested in.
Sometimes result set before filtering by user id can be huge (which
yields lots of cpu and memory usage). I guess it is possible to state
queries in a way which minimize this problem but as our users have
access to rich query language it might not be easy in every case.
I understood from your previous posts that problem of having a lot of
indices on one node is in amount of memory on the node. What do you
think (as a author of a code), would it be hard and long-term or not
to implement in ElasticSearch some kind of module which will cope with
this problem (e.g. LRU of needed resources as Łukasz mentioned)?
The memory problem can certainly be solved if it posses a problem (and I
still need to measure it to see where the limits are). But, each shard is a
Lucene index, with its own file handles open and so on. You can certainly
decrease that (especially if you have small indices per user), and can up
the max file descriptors per machine, which should be good enough. The
question is where the limit is, can the machine handle 10k shards on it even
if memory is not a problem? In theory yes, in practice, I have never tested
it.
I understood from your previous posts that problem of having a lot of
indices on one node is in amount of memory on the node. What do you
think (as a author of a code), would it be hard and long-term or not
to implement in Elasticsearch some kind of module which will cope with
this problem (e.g. LRU of needed resources as Łukasz mentioned)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.