Clearly I'm missing something

Fred_Manley · September 25, 2014, 1:14pm

Hey guys,

I have elastic search installed on 2 windows 2012 servers, (2 cores each,
8GB of RAM each, 4GB ES_HEAP each) with mostly default settings, and the
following yml:

bootstrap.mlockall: true

Search pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

Bulk pool

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

I was trying optimize for bulk indexing.

I'm reading 500k rows from a database at about 3 min per read, and then
calling the bulk api with a batch size of 5k in a new thread so I can index
while making the next read. I'm running this process on one of the
elasticsearch nodes, which also gets marked as master.

I'm experiencing a whole slew of problems.

Swap space is 5-6GB on each node - I'm not sure what this means on
windows, but I've disabled the page file and that didn't help.
Some data isn't getting indexed. I'll run it pointing to one index, and
the count doesn't match up with running it against a new index. Retrying
sometimes solves this.
The data randomly disappears. At various times it claims I have
anywhere from 500k documents to 80M documents, even when both nodes are up.
I have it set to two shards, which defaulted to 2 shards on the same
node, but randomly switches to one shard on the other node.
Nodes seem to get unbootstrapped fairly frequently, which results in
loss of data as well.
ElasticHq claims, again at various stages, that elasticsearch has
deleted the missing documents. Sometimes these documents mysteriously get
undeleted, and show up again in search.
I've tried to stop these issues by refreshing the index, then stopping
the bulk indexer app, then restarting elasticsearch on both nodes, which
has also resulted in GB worth of data loss.

What am I doing wrong here? Elasticsearch is completely unusable from my
perspective. These issues aren't even acceptable in a development
environment, let alone a production environment.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ef0d783a-9be1-463b-9705-6919694ca3f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fred_Manley · September 25, 2014, 1:31pm

Doing some more digging, it looks like what happened is it moved one shard
to another node, and since it hadn't balanced the shards before (all the
data was in one shard), it started indexing all new data to the other shard
on the other node. Looking at this node, this data has been deleted
completely. I'm guessing there's no way to recover it.

It's also not making use of resources well either. One node uses close to
100% of CPU and 80% of memory at all times, while the other uses 20%/12%.

This seems like a common problem with elasticsearch. It's commonly
marketing horizontal scaleability, but every time I've tried to have more
than one node, it stops working properly altogether.

On Thursday, September 25, 2014 9:14:48 AM UTC-4, Fred Manley wrote:

Hey guys,

I have Elasticsearch installed on 2 windows 2012 servers, (2 cores each,
8GB of RAM each, 4GB ES_HEAP each) with mostly default settings, and the
following yml:

bootstrap.mlockall: true

Search pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

Bulk pool

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

I was trying optimize for bulk indexing.

I'm reading 500k rows from a database at about 3 min per read, and then
calling the bulk api with a batch size of 5k in a new thread so I can index
while making the next read. I'm running this process on one of the
elasticsearch nodes, which also gets marked as master.

I'm experiencing a whole slew of problems.

Swap space is 5-6GB on each node - I'm not sure what this means on
windows, but I've disabled the page file and that didn't help.

Some data isn't getting indexed. I'll run it pointing to one index,
and the count doesn't match up with running it against a new index.
Retrying sometimes solves this.

The data randomly disappears. At various times it claims I have
anywhere from 500k documents to 80M documents, even when both nodes are up.

I have it set to two shards, which defaulted to 2 shards on the same
node, but randomly switches to one shard on the other node.

Nodes seem to get unbootstrapped fairly frequently, which results in
loss of data as well.

ElasticHq claims, again at various stages, that elasticsearch has
deleted the missing documents. Sometimes these documents mysteriously get
undeleted, and show up again in search.

I've tried to stop these issues by refreshing the index, then stopping
the bulk indexer app, then restarting elasticsearch on both nodes, which
has also resulted in GB worth of data loss.

What am I doing wrong here? Elasticsearch is completely unusable from my
perspective. These issues aren't even acceptable in a development
environment, let alone a production environment.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ccf2604e-b749-402b-9ad4-093f4174c4cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 25, 2014, 1:49pm

What you describe looks like a misconfigured system.

Do not use 2 nodes. Use at least 3 nodes for a reliable distributed system.
Think of split brains. 2 nodes are prone to this.

Also do not use 2 shards, use one shard per node, i.e. 3 shards on 3 nodes
per index. For bulk indexing, you can set replica temporarily to 0.

mlockall is not for Windows.

The bulk thread pool and queue sizes are too high for only 2 CPU cores. The
default values for thread pool are ok. Without seeing the code how the bulk
API is programmed, it is impossible to find problems. But your point 2 and
3 seems like a natural consequence of exceeding the limits of the thread
pool dimensions for your node power, maybe also the quite small heap volume
brings a node down from time to time, because you did not adjust the 5s
node fault timeout. Adjusting timeout may help but only if you limit the
thread pool.

Jörg

On Thu, Sep 25, 2014 at 3:31 PM, Fred Manley fmanley@northpointdigital.com
wrote:

Doing some more digging, it looks like what happened is it moved one shard
to another node, and since it hadn't balanced the shards before (all the
data was in one shard), it started indexing all new data to the other shard
on the other node. Looking at this node, this data has been deleted
completely. I'm guessing there's no way to recover it.

It's also not making use of resources well either. One node uses close to
100% of CPU and 80% of memory at all times, while the other uses 20%/12%.

This seems like a common problem with elasticsearch. It's commonly
marketing horizontal scaleability, but every time I've tried to have more
than one node, it stops working properly altogether.

On Thursday, September 25, 2014 9:14:48 AM UTC-4, Fred Manley wrote:

Hey guys,

I have Elasticsearch installed on 2 windows 2012 servers, (2 cores each,
8GB of RAM each, 4GB ES_HEAP each) with mostly default settings, and the
following yml:

bootstrap.mlockall: true

Search pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

Bulk pool

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

I was trying optimize for bulk indexing.

I'm reading 500k rows from a database at about 3 min per read, and then
calling the bulk api with a batch size of 5k in a new thread so I can index
while making the next read. I'm running this process on one of the
elasticsearch nodes, which also gets marked as master.

I'm experiencing a whole slew of problems.

Swap space is 5-6GB on each node - I'm not sure what this means on
windows, but I've disabled the page file and that didn't help.

Some data isn't getting indexed. I'll run it pointing to one index,
and the count doesn't match up with running it against a new index.
Retrying sometimes solves this.

The data randomly disappears. At various times it claims I have
anywhere from 500k documents to 80M documents, even when both nodes are up.

I have it set to two shards, which defaulted to 2 shards on the same
node, but randomly switches to one shard on the other node.

Nodes seem to get unbootstrapped fairly frequently, which results in
loss of data as well.

ElasticHq claims, again at various stages, that elasticsearch has
deleted the missing documents. Sometimes these documents mysteriously get
undeleted, and show up again in search.

I've tried to stop these issues by refreshing the index, then stopping
the bulk indexer app, then restarting elasticsearch on both nodes, which
has also resulted in GB worth of data loss.

What am I doing wrong here? Elasticsearch is completely unusable from my
perspective. These issues aren't even acceptable in a development
environment, let alone a production environment.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ccf2604e-b749-402b-9ad4-093f4174c4cf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ccf2604e-b749-402b-9ad4-093f4174c4cf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGmYZsaUBMK6kC8U%2B_Ovy55RY6JvW9c1wwi-m80FZgrmw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ThreadPool Setting's for bulk indexing in elasticsearch.yml Elasticsearch	5	8661	July 5, 2017
Threadpool.bulk.size key ignored Elasticsearch	2	293	July 15, 2021
Bulk indexing rejected threads Elasticsearch	14	707	April 13, 2020
Change thread pool search queue_size? yes or not? Elasticsearch	13	4182	October 4, 2017
Elasticsearch performance question Elasticsearch	3	381	February 3, 2017

Clearly I'm missing something

Search pool

Bulk pool

Search pool

Bulk pool

Search pool

Bulk pool

Related topics