I am pretty new to elastic search so what follows might be obvious to most
people, but i am currently experiencing something i was not expecting from
my indexing cluster. In my process chain, i am first preprocessing some
data, which i then send to my elastic search cluster for indexing.
When working with a 10GB sample, elastic search is able to do all the
indexing in 20 mins on a 3 machine cluster; when however trying to index
30GB (300+ million documents), the process takes hours, starting out pretty
fast, slowing down with time. I am also indexing the data by bulking 5000
operations before sending them to the cluster. While i understand that
scaling out would possibly solve my problem, I had assumed that a cluster
of 3 good machines would have been able to handle it, or was i wrong?
Can anyone help me understand why this is happening? Or perhaps point me
into the right direction?
Are your servers set up with enough memory, and has enough memory been
allocated to the JVM heap? Depending on the structure of your indices,
you may be hitting disk.
I am pretty new to Elasticsearch so what follows might be obvious to most
people, but i am currently experiencing something i was not expecting from
my indexing cluster. In my process chain, i am first preprocessing some
data, which i then send to my Elasticsearch cluster for indexing.
When working with a 10GB sample, Elasticsearch is able to do all the
indexing in 20 mins on a 3 machine cluster; when however trying to index
30GB (300+ million documents), the process takes hours, starting out pretty
fast, slowing down with time. I am also indexing the data by bulking 5000
operations before sending them to the cluster. While i understand that
scaling out would possibly solve my problem, I had assumed that a cluster of
3 good machines would have been able to handle it, or was i wrong?
Can anyone help me understand why this is happening? Or perhaps point me
into the right direction?
Each node in the cluster is equipped with 8GB of RAM, running on 512MB of
heap, and handling 6 indices.
I noticed that one of the indices becomes much bigger than i would expect:
it reaches a size multiple times larger than the indexed data. The index in
question stores a double type field which the others do not. I set the
precision_step attribute in the mapping for this type in an effort to
reduce the size, also compressing the _source field (which i need present
in the response) and disabling _all; i am also storing and analyzing only
what is needed.
Is there any good way to tune the indices and configurations of machines,
and to know what exactly is going on (I use bigdesk to get a handle on
what's going on atm)? And, at the risk of being a bit vague, can such a
problem be solved by configuration, or do i still need to add more nodes to
the cluster anyway in the end?
Each node in the cluster is equipped with 8GB of RAM, running on 512MB of
heap, and handling 6 indices.
Considering the size of the index you mentioned, 512 MB of heap does
seem a bit low. In our installation of ES, we typically allocate
enough heap to keep the entire index in memory, although Shay has
mentioned this is not necessary. I'd suggest you tweak these and see
what happens.
I noticed that one of the indices becomes much bigger than i would expect:
it reaches a size multiple times larger than the indexed data. The index in
question stores a double type field which the others do not. I set the
precision_step attribute in the mapping for this type in an effort to reduce
the size, also compressing the _source field (which i need present in the
response) and disabling _all; i am also storing and analyzing only what is
needed.
Have you tried optimizing your indices?
It often reduces index size and query speed.
Is there any good way to tune the indices and configurations of machines,
and to know what exactly is going on (I use bigdesk to get a handle on
what's going on atm)? And, at the risk of being a bit vague, can such a
problem be solved by configuration, or do i still need to add more nodes to
the cluster anyway in the end?
There are lots of knobs you can tweak to keep your cluster healthy.
What knobs to tweak, or how big your cluster is, is usually a function
of what you're optimizing for. If you're optimizing for indexing
speed, then you'll be looking at your index mapping, and minimize
analysis and storing unnecessary data. If you're optimizing for query
speed, then you're dealing with index structure, query structure, and
a number of application-level knobs I don't know enough about to
comment on at length. If your data is very large, but you don't get
many queries, you may want to have one node with a lot of memory. If
you get lots of queries on a small index, more nodes makes sense.
Happy to be corrected here if I'm talking out of my ass!
I have tried tweaking settings like number of segments and flushes for
example, and they did provide some improvement, but not to a considerable
extent. I will try allocating more heap just to see if it actually makes
any difference, but i cannot say i have seen many Out of Memory errors
related to heap.
In memory storage is a lot faster of course, and starts degrading in
indexing performance at a later point, but as soon as swapping starts
occurring, speed starts decreasing dramatically again - which in some ways
speaks to me of scaling out a bit more (which can become expensive in terms
of RAM when the indexes start getting much bigger than 30~40GB). I will try
increasing memory allocation further, and tweaking swap space, but i don't
see this being a really good long term solution.
I will post more info after having tweaking the memory a bit further.
I have increased the heap size to 6GB per node and it works. However, I am
not entirely sure i am happy with this solution, considering that the
degradation point seems still pretty much unpredictable to me at the
moment.
One thing i must mention however is that after tweaking my client code to
create a huge number of small indexes, i was able to index the 40GB of data
on the cluster with the same settings. The downside to this is that I need
to be able query the data across all indexes for specific types, which
would mean much much bigger response times for queries. With my current
number of documents after optimizing the number of segments per index, i
got response times down to a few seconds over 10 for a basic query.
With a large number of small indexes, it seems that the degradation never
happens (at least not within the limits of the data i was working with)
assuming that there is enough heap to support it (smaller merges and
flushes?). Is there a downside besides query response times and complexity
to having a huge number of smaller indexes as opposed to having a small
number of huge indexes?
The degradation is most likely due to merges and IO performance. You will
probably want to tweak your merge settings and keep an eye on your IOwait
percentage. I've just finished tweaking my cluster to handle 40-60k
inserts/sec for small records.. happy to offer any tips if you have
questions after looking at the merge settings.
On Wednesday, August 29, 2012 8:44:41 AM UTC-4, Glyton Camilleri wrote:
I have increased the heap size to 6GB per node and it works. However, I am
not entirely sure i am happy with this solution, considering that the
degradation point seems still pretty much unpredictable to me at the
moment.
One thing i must mention however is that after tweaking my client code to
create a huge number of small indexes, i was able to index the 40GB of data
on the cluster with the same settings. The downside to this is that I need
to be able query the data across all indexes for specific types, which
would mean much much bigger response times for queries. With my current
number of documents after optimizing the number of segments per index, i
got response times down to a few seconds over 10 for a basic query.
With a large number of small indexes, it seems that the degradation never
happens (at least not within the limits of the data i was working with)
assuming that there is enough heap to support it (smaller merges and
flushes?). Is there a downside besides query response times and complexity
to having a huge number of smaller indexes as opposed to having a small
number of huge indexes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.