ElasticSearch Transport client with multiple addresses fails

I'm getting closer to opening an issue against ES 0.90.3. But first, I'd
like to see what others think. Here's my scenario:

I have a test driver that can define N update threads and N query threads.
The update threads automatically generate a sequence of unique documents
and update them. Then they add their IDs to a bigqueue instance, wherein
the query threads read each document's ID and issue a query. Each document
has a 10-second time to live. The driver tracks all stats (elapsed time,
total updates, total queries, total elapsed time for all updates and
queries, errors, and so on).

When I run this on my MacBook, with 8 update threads and 8 query threads, I
see an update rate of over 60/second. When I run it on a Linux laptop (same
quad core i7 CPUs, very similar laptop-class disk drive), I see an update
rate of 268/second. Cool. No errors in either case.

When I pointed the driver (running on the MacBook) to a remote 3-node
cluster, added all 3 node addresses to the TransportClient in the driver,
and ran it, it got update errors galore. For example, I get lots of these
on the client (test driver) side:

org.elasticsearch.transport.TransportSerializationException: Failed to
deserialize exception response from stream

And on one of the servers, I see the errors such as the following in the ES
log for the cluster:

Failed to execute [index
{[rtctest][connection][3136393030303031363940766572697A6F6E2E636F6D],
source[{"onet":"xxxx","orig":"1690000169@xxxx.com","term":"1700000170@xxxx.com"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
[_ttl]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:396)
at
org.elasticsearch.index.mapper.internal.TTLFieldMapper.postParse(TTLFieldMapper.java:167)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:525)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:329)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:521)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

I then re-ran the test to the remote cluster, but this time I only added
the address of one of the nodes to the TransportClient in the driver, the
test ran fine with no update failures at all. It's as if issuing updates
with more than one host address added to the TransportClient causes
failures.

There is only one document type, the fields are not indexed (for maximum
update performance, all queries are get-by-id, the default TTL is given as
10s in the mappings (which is honored and processed, which is kind of cool,
since this makes the test case self-restarting!) There is one replica
defined.

And the cluster state stays green throughout all tests, even for the
failures.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you have the exact same version of Java running between all the nodes
and your client?

There are a few places in the ES codebase where native Java serialization
is used...and unfortunately minor updates sometimes change the
serialization and break things. For example, there is a known change with
exception serialization between 1.7.0_17 and 1.7.0_21.

Also make sure you are running the same version of ES on both client and
server.

Based on the exceptions you are seeing, my first guess is the Java version
problem.
-Zach

On Sunday, September 15, 2013 3:15:58 PM UTC-4, InquiringMind wrote:

I'm getting closer to opening an issue against ES 0.90.3. But first, I'd
like to see what others think. Here's my scenario:

I have a test driver that can define N update threads and N query threads.
The update threads automatically generate a sequence of unique documents
and update them. Then they add their IDs to a bigqueue instance, wherein
the query threads read each document's ID and issue a query. Each document
has a 10-second time to live. The driver tracks all stats (elapsed time,
total updates, total queries, total elapsed time for all updates and
queries, errors, and so on).

When I run this on my MacBook, with 8 update threads and 8 query threads,
I see an update rate of over 60/second. When I run it on a Linux laptop
(same quad core i7 CPUs, very similar laptop-class disk drive), I see an
update rate of 268/second. Cool. No errors in either case.

When I pointed the driver (running on the MacBook) to a remote 3-node
cluster, added all 3 node addresses to the TransportClient in the driver,
and ran it, it got update errors galore. For example, I get lots of these
on the client (test driver) side:

org.elasticsearch.transport.TransportSerializationException: Failed to
deserialize exception response from stream

And on one of the servers, I see the errors such as the following in the
ES log for the cluster:

Failed to execute [index
{[rtctest][connection][3136393030303031363940766572697A6F6E2E636F6D],
source[{"onet":"xxxx","orig":"1690000169@xxxx.com <javascript:>","term":"
1700000170@xxxx.com <javascript:>"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
[_ttl]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:396)
at
org.elasticsearch.index.mapper.internal.TTLFieldMapper.postParse(TTLFieldMapper.java:167)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:525)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:329)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:521)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

I then re-ran the test to the remote cluster, but this time I only added
the address of one of the nodes to the TransportClient in the driver, the
test ran fine with no update failures at all. It's as if issuing updates
with more than one host address added to the TransportClient causes
failures.

There is only one document type, the fields are not indexed (for maximum
update performance, all queries are get-by-id, the default TTL is given as
10s in the mappings (which is honored and processed, which is kind of cool,
since this makes the test case self-restarting!) There is one replica
defined.

And the cluster state stays green throughout all tests, even for the
failures.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Zach,

Thanks for the quick feedback. Your response gave me a thought to run my
test driver on one of the cluster's nodes and point the TransportClient to
the other two nodes. All 3 nodes are running the exact same Java version,
so this helps eliminate any Java issues.

When I do this, here is the final statistics from my driver (mildly
informative to you, I know):

DONE: generated-connections=436 elapsed=10s [db-update: total=435
success=389 fail=46 time=9.5s] [db-query: total=435 found=389 not_found=46
time=885.8ms] [queue: current=0 max=1]

But here is one of the responses, complete with the Elasticsearch exception
message that can now be parsed:

FAILURE when writing connnection:
{"index":{"_index":"rtest","_type":"connection","_id":"35353330303030353533407576657273652E6E6574"}}
-> {"onet":"xxx","orig":"5530000553@xxx.net","term":"5540000554@xxx.com"}
:: class com.cequint.nameid.database.DatabaseException Elasticsearch index
request:
{"index":{"_index":"rtest","_type":"connection","_id":"35353330303030353533407576657273652E6E6574"}}
:: {"onet":"xxx","orig":"5530000553@xxx.net","term":"5540000554@xxx.com"} :
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
[_ttl]

Oddly enough, the same thing happens when I issue the command from that
node and point to one other node, and not just to the two other nodes.

But when I point the TransportClient to "localhost" then I get no failures
at all. That seems odd, as the cluster is green with 2 copies (in other
words 1 replica) of the single shard that is created and mapped. And the
TTL value of 10s causes all of the documents to vanish after the TTL
interval passes, and the cluster is still green, so that part seems to be
working.

By the way, have the exact same version of Java running on all 3 nodes, but
a different version on the two laptops (MacBook, and Ubuntu). That's
because the laptops have their own preferred Java 1.6, but our cluster is
deployed using another 1.6 version specified by our IT folks. On all 3
nodes of the cluster:

java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) Client VM (build 16.0-b13, mixed mode, sharing)

But on the MacBook:

java version "1.6.0_51"
Java(TM) SE Runtime Environment (build 1.6.0_51-b11-457-10M4509)
Java HotSpot(TM) 64-Bit Server VM (build 20.51-b01-457, mixed mode)

So, yes, Java is still "write once, debug everywhere". But at least when I
run the test driver on one of the cluster nodes, the same Java version
means it can now parse the Elasticsearch failure message.

Brian

On Sunday, September 15, 2013 4:07:06 PM UTC-4, Zachary Tong wrote:

Do you have the exact same version of Java running between all the
nodes and your client?

There are a few places in the ES codebase where native Java serialization
is used...and unfortunately minor updates sometimes change the
serialization and break things. For example, there is a known change with
exception serialization between 1.7.0_17 and 1.7.0_21.

Also make sure you are running the same version of ES on both client and
server.

Based on the exceptions you are seeing, my first guess is the Java version
problem.
-Zach

On Sunday, September 15, 2013 3:15:58 PM UTC-4, InquiringMind wrote:

I'm getting closer to opening an issue against ES 0.90.3. But first, I'd
like to see what others think. Here's my scenario:

I have a test driver that can define N update threads and N query
threads. The update threads automatically generate a sequence of unique
documents and update them. Then they add their IDs to a bigqueue instance,
wherein the query threads read each document's ID and issue a query. Each
document has a 10-second time to live. The driver tracks all stats (elapsed
time, total updates, total queries, total elapsed time for all updates and
queries, errors, and so on).

When I run this on my MacBook, with 8 update threads and 8 query threads,
I see an update rate of over 60/second. When I run it on a Linux laptop
(same quad core i7 CPUs, very similar laptop-class disk drive), I see an
update rate of 268/second. Cool. No errors in either case.

When I pointed the driver (running on the MacBook) to a remote 3-node
cluster, added all 3 node addresses to the TransportClient in the driver,
and ran it, it got update errors galore. For example, I get lots of these
on the client (test driver) side:

org.elasticsearch.transport.TransportSerializationException: Failed to
deserialize exception response from stream

And on one of the servers, I see the errors such as the following in the
ES log for the cluster:

Failed to execute [index
{[rtctest][connection][3136393030303031363940766572697A6F6E2E636F6D],
source[{"onet":"xxxx","orig":"1690000169@xxxx.com","term":"
1700000170@xxxx.com"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
[_ttl]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:396)
at
org.elasticsearch.index.mapper.internal.TTLFieldMapper.postParse(TTLFieldMapper.java:167)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:525)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:329)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:521)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

I then re-ran the test to the remote cluster, but this time I only added
the address of one of the nodes to the TransportClient in the driver, the
test ran fine with no update failures at all. It's as if issuing updates
with more than one host address added to the TransportClient causes
failures.

There is only one document type, the fields are not indexed (for maximum
update performance, all queries are get-by-id, the default TTL is given as
10s in the mappings (which is honored and processed, which is kind of cool,
since this makes the test case self-restarting!) There is one replica
defined.

And the cluster state stays green throughout all tests, even for the
failures.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1.6.0_18 is a three year old JVM with many unfixed bugs and should be
upgraded asap.

You should exactly match the JVM versions (here: 16.0-b13 and
20.51-b01-457), or your setting is doomed.

There is no reason why Java 6 should be used any longer, it is no longer
supported by Oracle. Better chances for maintainability and future support
can be achieved by using latest Java 7 JVMs, both Linux and Mac.

For myself, I currently use

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

waiting for 7u40+ until
https://issues.apache.org/jira/browse/LUCENE-5212is resolved.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Echoing what Jörg said. There are many, many bugs in older JVMs that were
found because Lucene discovered them. What this means is that there are
known issues with the JVM that directly affect Lucene (and therefore
Elasticsearch). These aren't just performance bugs - they are "crash your
application because the JVM is buggy" bugs. =)

Definitely upgrade to 1.7 asap. In case it wasn't clear in my first post,
you need to ensure that java on both the client machines (your laptops) is
identical to the java version on the server machines. The reason you
were/are getting serialization errors is because the versions are different.

Regarding your parser error...I'm not really sure, can you gist the
serialized JSON of the request and the full error response from the logs?
It's probably just a syntax error in your mapping or document, or you are
trying to index a document that has already expired. Something to note:
TTL only runs once per 60s by default, so if you try to query a document
after 10s it may still exist when you expect it to be deleted. You can
change the TTL interval with indices.ttl.interval

-Zach

On Sunday, September 15, 2013 6:52:51 PM UTC-4, Jörg Prante wrote:

1.6.0_18 is a three year old JVM with many unfixed bugs and should be
upgraded asap.

You should exactly match the JVM versions (here: 16.0-b13 and
20.51-b01-457), or your setting is doomed.

There is no reason why Java 6 should be used any longer, it is no longer
supported by Oracle. Better chances for maintainability and future support
can be achieved by using latest Java 7 JVMs, both Linux and Mac.

For myself, I currently use

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

waiting for 7u40+ until [LUCENE-5212] java 7u40 causes sigsegv and corrupt term vectors - ASF JIRA resolved.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OK. Updating the Java version is on the list of things to do but that's
something that I have to work with the bureaucracy to solve.

The TTL processing well understood: A 10s TTL with a 1m interval means that
at worst case, a get-by-id will return a negative _ttl for up to 50s until
the document is deleted. My test driver currently classifies a response
document with a negative _ttl as a Not Found condition (to mirror the
intended production use), and that all works well. I'm guessing that it's
best to keep the default ttl interval to prevent ES from thrashing; I
certainly don't need (nor can expect) realtime TTL cleanup. And the ES
documentation says that versioning is used so that a subsequent index
against an expired-but-not-yet-deleted document won't cause that document
to be deleted during that cycle (not the best words, but that's the idea).

But I've been playing around with the test driver, and the only
Java-related version I can see is that the exception message cannot be
parsed. But the failure occurs regardless. And here's where it gets
interesting: It's fully repeatable and seems to be an ES issue and not a
Lucene+Java issue:

The cluster has 3 nodes, let's call them A, B, and C. The hosts are set up
with static routing, and that seems to be working (according to
Elasticsearch Head). The index is configured with 1 shard and 1 replica;
One primary shard currently exists on A and one replica shard on C. Host B
has no shards (since only one replica is defined, ES seems to have reserved
host B for future failover uses).

Then I ran the client on the Mac. Yeah, a different buggy but older Java
version, but let's see what happens anyway:

  1. If I create a TransportClient and add only host A, or only host B, or
    both hosts A and B, then I see no failures at all. None. The mix of updates
    and queries works flawlessly, the TTL processing works flawlessly, and the
    driver runs to completion with all index ops OK, all queries OK, and the
    index empty after the driver ends and the ttl interval fires. Lucene is as
    happy as it can be.

  2. If I create a TransportClient and add only host B, then 100% of the
    updates fail. On the MacBook, I see the error parsing the exception. But
    when the client is run on the cluster with the same Java version, I see the
    _ttl parse failure. Not sure what it means, but it only happens to updates
    that are directed to host B that hosts no shards of that index.

  3. If I create a TransportClient and add all 3 hosts, then yes, the
    TransportClient round-robins the updates. And I see that 1 out of 3 of the
    updates fail and 2 out of 3 succeed.

It is repeatable that adding a host to the TransportClient that doesn't
currently contain a shard results in a 100% failure to update when the
update is directed to that host, even though that host is part of a cluster
that has 2 copies of that shard and whose cluster status is Green.

There is a maximum of about 200-300 documents at any point in time due to
TTL processing; hence the definition of only 1 shard. And there's only 1
replica in the 3-node cluster to ensure a green status even if one of nodes
crashes. Or is defining fewer replicas than nodes my problem?

Brian

On Sunday, September 15, 2013 8:10:36 PM UTC-4, Zachary Tong wrote:

Echoing what Jörg said. There are many, many bugs in older JVMs that were
found because Lucene discovered them. What this means is that there are
known issues with the JVM that directly affect Lucene (and therefore
Elasticsearch). These aren't just performance bugs - they are "crash your
application because the JVM is buggy" bugs. =)

Definitely upgrade to 1.7 asap. In case it wasn't clear in my first post,
you need to ensure that java on both the client machines (your laptops) is
identical to the java version on the server machines. The reason you
were/are getting serialization errors is because the versions are different.

Regarding your parser error...I'm not really sure, can you gist the
serialized JSON of the request and the full error response from the logs?
It's probably just a syntax error in your mapping or document, or you are
trying to index a document that has already expired. Something to note:
TTL only runs once per 60s by default, so if you try to query a document
after 10s it may still exist when you expect it to be deleted. You can
change the TTL interval with indices.ttl.interval

-Zach

On Sunday, September 15, 2013 6:52:51 PM UTC-4, Jörg Prante wrote:

1.6.0_18 is a three year old JVM with many unfixed bugs and should be
upgraded asap.

You should exactly match the JVM versions (here: 16.0-b13 and
20.51-b01-457), or your setting is doomed.

There is no reason why Java 6 should be used any longer, it is no longer
supported by Oracle. Better chances for maintainability and future support
can be achieved by using latest Java 7 JVMs, both Linux and Mac.

For myself, I currently use

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

waiting for 7u40+ until [LUCENE-5212] java 7u40 causes sigsegv and corrupt term vectors - ASF JIRA resolved.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oops, since it's nodes A and C that currently contain the shards, I
actually meant:

  1. If I create a TransportClient and add only host A, or only host C, or
    both hosts A and C, then I see no failures at all. None. The mix of
    updates and queries works flawlessly, the TTL processing works flawlessly,
    and the driver runs to completion with all index ops OK, all queries OK,
    and the index empty after the driver ends and the ttl interval fires.
    Lucene is as happy as it can be.

But my question still stands: If one of the nodes in a cluster does not
contain a shard due to the number of replicas plus 1 being less than the
number of nodes, is it my bug or ES's bug that an update directed to the
shard-less node fails?

Thanks!

Brian

On Monday, September 16, 2013 12:01:19 PM UTC-4, InquiringMind wrote:

OK. Updating the Java version is on the list of things to do but that's
something that I have to work with the bureaucracy to solve.

The TTL processing well understood: A 10s TTL with a 1m interval means
that at worst case, a get-by-id will return a negative _ttl for up to 50s
until the document is deleted. My test driver currently classifies a
response document with a negative _ttl as a Not Found condition (to mirror
the intended production use), and that all works well. I'm guessing that
it's best to keep the default ttl interval to prevent ES from thrashing; I
certainly don't need (nor can expect) realtime TTL cleanup. And the ES
documentation says that versioning is used so that a subsequent index
against an expired-but-not-yet-deleted document won't cause that document
to be deleted during that cycle (not the best words, but that's the idea).

But I've been playing around with the test driver, and the only
Java-related version I can see is that the exception message cannot be
parsed. But the failure occurs regardless. And here's where it gets
interesting: It's fully repeatable and seems to be an ES issue and not a
Lucene+Java issue:

The cluster has 3 nodes, let's call them A, B, and C. The hosts are set up
with static routing, and that seems to be working (according to
Elasticsearch Head). The index is configured with 1 shard and 1 replica;
One primary shard currently exists on A and one replica shard on C. Host B
has no shards (since only one replica is defined, ES seems to have reserved
host B for future failover uses).

Then I ran the client on the Mac. Yeah, a different buggy but older Java
version, but let's see what happens anyway:

  1. If I create a TransportClient and add only host A, or only host B, or
    both hosts A and B, then I see no failures at all. None. The mix of updates
    and queries works flawlessly, the TTL processing works flawlessly, and the
    driver runs to completion with all index ops OK, all queries OK, and the
    index empty after the driver ends and the ttl interval fires. Lucene is as
    happy as it can be.

  2. If I create a TransportClient and add only host B, then 100% of the
    updates fail. On the MacBook, I see the error parsing the exception. But
    when the client is run on the cluster with the same Java version, I see the
    _ttl parse failure. Not sure what it means, but it only happens to updates
    that are directed to host B that hosts no shards of that index.

  3. If I create a TransportClient and add all 3 hosts, then yes, the
    TransportClient round-robins the updates. And I see that 1 out of 3 of the
    updates fail and 2 out of 3 succeed.

It is repeatable that adding a host to the TransportClient that doesn't
currently contain a shard results in a 100% failure to update when the
update is directed to that host, even though that host is part of a cluster
that has 2 copies of that shard and whose cluster status is Green.

There is a maximum of about 200-300 documents at any point in time due to
TTL processing; hence the definition of only 1 shard. And there's only 1
replica in the 3-node cluster to ensure a green status even if one of nodes
crashes. Or is defining fewer replicas than nodes my problem?

Brian

On Sunday, September 15, 2013 8:10:36 PM UTC-4, Zachary Tong wrote:

Echoing what Jörg said. There are many, many bugs in older JVMs that
were found because Lucene discovered them. What this means is that there
are known issues with the JVM that directly affect Lucene (and therefore
Elasticsearch). These aren't just performance bugs - they are "crash your
application because the JVM is buggy" bugs. =)

Definitely upgrade to 1.7 asap. In case it wasn't clear in my first
post, you need to ensure that java on both the client machines (your
laptops) is identical to the java version on the server machines. The
reason you were/are getting serialization errors is because the versions
are different.

Regarding your parser error...I'm not really sure, can you gist the
serialized JSON of the request and the full error response from the logs?
It's probably just a syntax error in your mapping or document, or you are
trying to index a document that has already expired. Something to note:
TTL only runs once per 60s by default, so if you try to query a document
after 10s it may still exist when you expect it to be deleted. You can
change the TTL interval with indices.ttl.interval

-Zach

On Sunday, September 15, 2013 6:52:51 PM UTC-4, Jörg Prante wrote:

1.6.0_18 is a three year old JVM with many unfixed bugs and should be
upgraded asap.

You should exactly match the JVM versions (here: 16.0-b13 and
20.51-b01-457), or your setting is doomed.

There is no reason why Java 6 should be used any longer, it is no longer
supported by Oracle. Better chances for maintainability and future support
can be achieved by using latest Java 7 JVMs, both Linux and Mac.

For myself, I currently use

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

waiting for 7u40+ until
[LUCENE-5212] java 7u40 causes sigsegv and corrupt term vectors - ASF JIRA is resolved.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey, sorry for the delay. Crazy week so far.

I saw above that you said the "set up with static routing"...what do you
mean by that? Are you using your own custom routing? Or did you mean the
default route-by-id that ES uses?

Under normal circumstances, you can direct an Update command to any node
(whether they have the data or not) and they will know where to forward it.
If you are using custom-routing though, you'll have to define the routing
parameter yourself as ES doesn't know where to send the data anymore.

-Zach

On Monday, September 16, 2013 12:11:54 PM UTC-4, InquiringMind wrote:

Oops, since it's nodes A and C that currently contain the shards, I
actually meant:

  1. If I create a TransportClient and add only host A, or only host C,
    or both hosts A and C, then I see no failures at all. None. The mix of
    updates and queries works flawlessly, the TTL processing works flawlessly,
    and the driver runs to completion with all index ops OK, all queries OK,
    and the index empty after the driver ends and the ttl interval fires.
    Lucene is as happy as it can be.

But my question still stands: If one of the nodes in a cluster does not
contain a shard due to the number of replicas plus 1 being less than the
number of nodes, is it my bug or ES's bug that an update directed to the
shard-less node fails?

Thanks!

Brian

On Monday, September 16, 2013 12:01:19 PM UTC-4, InquiringMind wrote:

OK. Updating the Java version is on the list of things to do but that's
something that I have to work with the bureaucracy to solve.

The TTL processing well understood: A 10s TTL with a 1m interval means
that at worst case, a get-by-id will return a negative _ttl for up to 50s
until the document is deleted. My test driver currently classifies a
response document with a negative _ttl as a Not Found condition (to mirror
the intended production use), and that all works well. I'm guessing that
it's best to keep the default ttl interval to prevent ES from thrashing; I
certainly don't need (nor can expect) realtime TTL cleanup. And the ES
documentation says that versioning is used so that a subsequent index
against an expired-but-not-yet-deleted document won't cause that document
to be deleted during that cycle (not the best words, but that's the idea).

But I've been playing around with the test driver, and the only
Java-related version I can see is that the exception message cannot be
parsed. But the failure occurs regardless. And here's where it gets
interesting: It's fully repeatable and seems to be an ES issue and not a
Lucene+Java issue:

The cluster has 3 nodes, let's call them A, B, and C. The hosts are set
up with static routing, and that seems to be working (according to
Elasticsearch Head). The index is configured with 1 shard and 1 replica;
One primary shard currently exists on A and one replica shard on C. Host B
has no shards (since only one replica is defined, ES seems to have reserved
host B for future failover uses).

Then I ran the client on the Mac. Yeah, a different buggy but older Java
version, but let's see what happens anyway:

  1. If I create a TransportClient and add only host A, or only host B, or
    both hosts A and B, then I see no failures at all. None. The mix of updates
    and queries works flawlessly, the TTL processing works flawlessly, and the
    driver runs to completion with all index ops OK, all queries OK, and the
    index empty after the driver ends and the ttl interval fires. Lucene is as
    happy as it can be.

  2. If I create a TransportClient and add only host B, then 100% of the
    updates fail. On the MacBook, I see the error parsing the exception. But
    when the client is run on the cluster with the same Java version, I see the
    _ttl parse failure. Not sure what it means, but it only happens to updates
    that are directed to host B that hosts no shards of that index.

  3. If I create a TransportClient and add all 3 hosts, then yes, the
    TransportClient round-robins the updates. And I see that 1 out of 3 of the
    updates fail and 2 out of 3 succeed.

It is repeatable that adding a host to the TransportClient that doesn't
currently contain a shard results in a 100% failure to update when the
update is directed to that host, even though that host is part of a cluster
that has 2 copies of that shard and whose cluster status is Green.

There is a maximum of about 200-300 documents at any point in time due to
TTL processing; hence the definition of only 1 shard. And there's only 1
replica in the 3-node cluster to ensure a green status even if one of nodes
crashes. Or is defining fewer replicas than nodes my problem?

Brian

On Sunday, September 15, 2013 8:10:36 PM UTC-4, Zachary Tong wrote:

Echoing what Jörg said. There are many, many bugs in older JVMs that
were found because Lucene discovered them. What this means is that there
are known issues with the JVM that directly affect Lucene (and therefore
Elasticsearch). These aren't just performance bugs - they are "crash your
application because the JVM is buggy" bugs. =)

Definitely upgrade to 1.7 asap. In case it wasn't clear in my first
post, you need to ensure that java on both the client machines (your
laptops) is identical to the java version on the server machines. The
reason you were/are getting serialization errors is because the versions
are different.

Regarding your parser error...I'm not really sure, can you gist the
serialized JSON of the request and the full error response from the logs?
It's probably just a syntax error in your mapping or document, or you are
trying to index a document that has already expired. Something to note:
TTL only runs once per 60s by default, so if you try to query a document
after 10s it may still exist when you expect it to be deleted. You can
change the TTL interval with indices.ttl.interval

-Zach

On Sunday, September 15, 2013 6:52:51 PM UTC-4, Jörg Prante wrote:

1.6.0_18 is a three year old JVM with many unfixed bugs and should be
upgraded asap.

You should exactly match the JVM versions (here: 16.0-b13 and
20.51-b01-457), or your setting is doomed.

There is no reason why Java 6 should be used any longer, it is no
longer supported by Oracle. Better chances for maintainability and future
support can be achieved by using latest Java 7 JVMs, both Linux and Mac.

For myself, I currently use

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

waiting for 7u40+ until
[LUCENE-5212] java 7u40 causes sigsegv and corrupt term vectors - ASF JIRA is resolved.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Zach.

No problem. I've been poking at it from several different points of view,
and the only conclusion that I can come up with is that you and Jörg are
right about the Java versions. The plan is to use the version that Jörg
recommended and see how that goes.

When I said "static routing" I really meant unicast discovery. (I plead
overwork and stress for my incorrect terminology!). For example, HOSTS is
the space-delimited set of host names and is the same on all machines in
the cluster:

OPTS="-Des.discovery.zen.ping.multicast.enabled=false"
OPTS="$OPTS -Des.discovery.zen.ping.unicast.hosts=$HOSTS"
OPTS="$OPTS -Des.discovery.zen.minimum_master_nodes=$MIN_MASTERS"

The funny thing is, the Java versions and ES versions and my own jar file
builds are 100% the same on all three nodes. When I run the test driver on
host A, I can round-robin requests to A or C but they all fail when
directed to B. When I run my test driver on B, I can round-robin requests
to A and C but they fail 100% if directed to itself (either B explicitly,
or to localhost). Yet, when I run the test driver on host A and the
TransportClient round-robins between A and C, replication ensures that when
I run another query-only client on B I can see the data. So it's being
replicated to B.

So if this is a Java bug (we'll have to wait until that cluster is
updated), it appears that Lucene is perfectly and blissfully happy, the
cluster status never wavers from Green, but somewhere in the Elasticsearch
wrapper itself is a potentially fatal Java bug that affects ES but not
Lucene.

I'll be sure to report back with the results once we migrate to the Java
version that Jörg recommended and then re-run everything:

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode

Brian

On Thursday, September 19, 2013 8:50:12 AM UTC-4, Zachary Tong wrote:

Hey, sorry for the delay. Crazy week so far.

I saw above that you said the "set up with static routing"...what do you
mean by that? Are you using your own custom routing? Or did you mean the
default route-by-id that ES uses?

Under normal circumstances, you can direct an Update command to any node
(whether they have the data or not) and they will know where to forward it.
If you are using custom-routing though, you'll have to define the routing
parameter yourself as ES doesn't know where to send the data anymore.

-Zach

On Monday, September 16, 2013 12:11:54 PM UTC-4, InquiringMind wrote:

Oops, since it's nodes A and C that currently contain the shards, I
actually meant:

  1. If I create a TransportClient and add only host A, or only host C,
    or both hosts A and C, then I see no failures at all. None. The mix of
    updates and queries works flawlessly, the TTL processing works flawlessly,
    and the driver runs to completion with all index ops OK, all queries OK,
    and the index empty after the driver ends and the ttl interval fires.
    Lucene is as happy as it can be.

But my question still stands: If one of the nodes in a cluster does not
contain a shard due to the number of replicas plus 1 being less than the
number of nodes, is it my bug or ES's bug that an update directed to the
shard-less node fails?

Thanks!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ah, gotcha. Yeah, let us know what happens after you upgrade Java. If the
issues is still there after upgrading...we should try to build a curl/JSON
recreation and see if this is a TransportClient issue or a general ES issue.

Totally feel the stress/work thing...been totally crazy here too =)

-Zach

On Thu, Sep 19, 2013 at 11:32 AM, InquiringMind brian.from.fl@gmail.comwrote:

Hi, Zach.

No problem. I've been poking at it from several different points of view,
and the only conclusion that I can come up with is that you and Jörg are
right about the Java versions. The plan is to use the version that Jörg
recommended and see how that goes.

When I said "static routing" I really meant unicast discovery. (I plead
overwork and stress for my incorrect terminology!). For example, HOSTS is
the space-delimited set of host names and is the same on all machines in
the cluster:

OPTS="-Des.discovery.zen.ping.multicast.enabled=false"
OPTS="$OPTS -Des.discovery.zen.ping.unicast.hosts=$HOSTS"
OPTS="$OPTS -Des.discovery.zen.minimum_master_nodes=$MIN_MASTERS"

The funny thing is, the Java versions and ES versions and my own jar file
builds are 100% the same on all three nodes. When I run the test driver on
host A, I can round-robin requests to A or C but they all fail when
directed to B. When I run my test driver on B, I can round-robin requests
to A and C but they fail 100% if directed to itself (either B explicitly,
or to localhost). Yet, when I run the test driver on host A and the
TransportClient round-robins between A and C, replication ensures that when
I run another query-only client on B I can see the data. So it's being
replicated to B.

So if this is a Java bug (we'll have to wait until that cluster is
updated), it appears that Lucene is perfectly and blissfully happy, the
cluster status never wavers from Green, but somewhere in the Elasticsearch
wrapper itself is a potentially fatal Java bug that affects ES but not
Lucene.

I'll be sure to report back with the results once we migrate to the Java
version that Jörg recommended and then re-run everything:

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode

Brian

On Thursday, September 19, 2013 8:50:12 AM UTC-4, Zachary Tong wrote:

Hey, sorry for the delay. Crazy week so far.

I saw above that you said the "set up with static routing"...what do you
mean by that? Are you using your own custom routing? Or did you mean the
default route-by-id that ES uses?

Under normal circumstances, you can direct an Update command to any node
(whether they have the data or not) and they will know where to forward it.
If you are using custom-routing though, you'll have to define the routing
parameter yourself as ES doesn't know where to send the data anymore.

-Zach

On Monday, September 16, 2013 12:11:54 PM UTC-4, InquiringMind wrote:

Oops, since it's nodes A and C that currently contain the shards, I
actually meant:

  1. If I create a TransportClient and add only host A, or only host C,
    or both hosts A and C, then I see no failures at all. None. The mix
    of updates and queries works flawlessly, the TTL processing works
    flawlessly, and the driver runs to completion with all index ops OK, all
    queries OK, and the index empty after the driver ends and the ttl interval
    fires. Lucene is as happy as it can be.

But my question still stands: If one of the nodes in a cluster does not
contain a shard due to the number of replicas plus 1 being less than the
number of nodes, is it my bug or ES's bug that an update directed to the
shard-less node fails?

Thanks!

Brian

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-n6tjZk5Ku4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.