Inconsistent data among nodes for one (or more) indexes (ES 0.11)

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each with
thousands of documents (one of them with millions) and every now and then it
seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

Have you seen the problem with recovery and local gateway with more than one
index I posted? Also, do you see any exceptions in the log? The index seems
to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

Also, a few more points on the count:

  1. Going to different nodes is irrelevant. The count will round robin
    between the shards and be directed to the appropriate node(s).

  2. Do you index while you execute the count? Have you changed the refresh
    interval?

-shay.banon

On Fri, Oct 8, 2010 at 9:17 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

About 1. If you notice, I use only one shard per node, which means one of
them (al least) is bogged.
About 2. No. The data has been indexed sometime ago and updates are not
frequent.

On Fri, Oct 8, 2010 at 4:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Also, a few more points on the count:

  1. Going to different nodes is irrelevant. The count will round robin
    between the shards and be directed to the appropriate node(s).

  2. Do you index while you execute the count? Have you changed the refresh
    interval?

-shay.banon

On Fri, Oct 8, 2010 at 9:17 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

But do they still happen?

On Fri, Oct 8, 2010 at 10:04 PM, Pablo Borges pablort@gmail.com wrote:

About 1. If you notice, I use only one shard per node, which means one of
them (al least) is bogged.
About 2. No. The data has been indexed sometime ago and updates are not
frequent.

On Fri, Oct 8, 2010 at 4:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Also, a few more points on the count:

  1. Going to different nodes is irrelevant. The count will round robin
    between the shards and be directed to the appropriate node(s).

  2. Do you index while you execute the count? Have you changed the refresh
    interval?

-shay.banon

On Fri, Oct 8, 2010 at 9:17 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

No exceptions, but I needed to reduce my log level in order not to fill the
partition the logs are written (several search exceptions, which I'd like to
know if there's a way to disable)

Here's my logging.yml

rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action : INFO

appender:
console:
type: console
layout:
type: consolePattern
conversionPattern: "[%d{ABSOLUTE}][%-5p][%-25c] %m%n"

file:
type: dailyRollingFile
file: /var/log/elasticsearch/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ABSOLUTE}][%-5p][%-25c] %m%n"

On Fri, Oct 8, 2010 at 4:17 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

Yes:

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}

On Fri, Oct 8, 2010 at 5:05 PM, Shay Banon shay.banon@elasticsearch.comwrote:

But do they still happen?

On Fri, Oct 8, 2010 at 10:04 PM, Pablo Borges pablort@gmail.com wrote:

About 1. If you notice, I use only one shard per node, which means one of
them (al least) is bogged.
About 2. No. The data has been indexed sometime ago and updates are not
frequent.

On Fri, Oct 8, 2010 at 4:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Also, a few more points on the count:

  1. Going to different nodes is irrelevant. The count will round robin
    between the shards and be directed to the appropriate node(s).

  2. Do you index while you execute the count? Have you changed the refresh
    interval?

-shay.banon

On Fri, Oct 8, 2010 at 9:17 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

I suggest you keep action on DEBUG, I can help with the search exceptions as
well.

On Fri, Oct 8, 2010 at 10:06 PM, Pablo Borges pablort@gmail.com wrote:

No exceptions, but I needed to reduce my log level in order not to fill the
partition the logs are written (several search exceptions, which I'd like to
know if there's a way to disable)

Here's my logging.yml

rootLogger: INFO, console, file
logger:

log action execution errors for easier debugging

action : INFO

appender:
console:
type: console
layout:
type: consolePattern
conversionPattern: "[%d{ABSOLUTE}][%-5p][%-25c] %m%n"

file:
type: dailyRollingFile
file: /var/log/elasticsearch/${cluster.name}.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ABSOLUTE}][%-5p][%-25c] %m%n"

On Fri, Oct 8, 2010 at 4:17 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you seen the problem with recovery and local gateway with more than
one index I posted? Also, do you see any exceptions in the log? The index
seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.com wrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes each
with thousands of documents (one of them with millions) and every now and
then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley:

IF they still happen, you might get into the near real time aspect of
elasticsearch ... . The question is if they are not the same once there are
no indexing going on. Also, note that there is a problem with multiple
indices and the local gateway.

-shay.banon

On Fri, Oct 8, 2010 at 10:07 PM, Pablo Borges pablort@gmail.com wrote:

Yes:

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}

On Fri, Oct 8, 2010 at 5:05 PM, Shay Banon shay.banon@elasticsearch.comwrote:

But do they still happen?

On Fri, Oct 8, 2010 at 10:04 PM, Pablo Borges pablort@gmail.com wrote:

About 1. If you notice, I use only one shard per node, which means one of
them (al least) is bogged.
About 2. No. The data has been indexed sometime ago and updates are not
frequent.

On Fri, Oct 8, 2010 at 4:23 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Also, a few more points on the count:

  1. Going to different nodes is irrelevant. The count will round robin
    between the shards and be directed to the appropriate node(s).

  2. Do you index while you execute the count? Have you changed the
    refresh interval?

-shay.banon

On Fri, Oct 8, 2010 at 9:17 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Have you seen the problem with recovery and local gateway with more
than one index I posted? Also, do you see any exceptions in the log? The
index seems to have just one shard, how many replicas does it have?

-shay.banon

On Fri, Oct 8, 2010 at 9:06 PM, Pablo Borges pablort@gmail.comwrote:

Here's one of the problems I got frequently.

I have a 5 node ES (using local storage) cluster running 8 indexes
each with thousands of documents (one of them with millions) and every now
and then it seems like replication breaks:

for node in n1 n2 n3 n4 n5; do
curl "http://${node}:9200/index/_count?q=*"

{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":22408,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}
{"count":20230,"_shards":{"total":1,"successful":1,"failed":0}}

The only solution found so far is to recreate the index and reindex.

Any help will be really appreciatted! :smiley: