Problem with S3 gateway on EC2-based cluster

Hi All,

I'm brand new to ElasticSearch (it's great so far!) so forgive me if I'm
missing something obvious.

I'm trying to integrate ElasticSearch into a production environment. I have
a very volatile dataset that I need to fully reindex frequently, so I need
a lot more computing power for indexing than for day-to-day serving. So I
would like to have:

  1. A fairly large (~10 node) cluster of strong machines that I boot up and
    use only for indexing, and then shutdown (to save money).
  2. A small (~1-3) node cluster that I serve off of.

The hope is that I'll be able to have the large cluster store the indicies
to S3, shut it down, and then start the small cluster and have it pull the
indicies down from the same S3 bucket.

I managed to get the large cluster started, the indicies built, and the
data (apparently saved to S3). But when I start the small cluster with the
same gateway settings, it gets stuck in the startup with the logs showing:

[2012-11-27 21:44:25,928][TRACE][index.gateway.s3 ] [Catiana]
[labels][3] recovering_files [129] with total_size [341.9mb], reusing_files
[0] with reused_size [0b]
[2012-11-27 21:45:53,851][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,073][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,074][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection

Over and over, until eventually it starts showing apache connection timeout
errors and just loops endlessly through those. I've spent a couple hours
trying to diagnos to no avail. It seems like the issue is either with S3
itself or somewhere fairly deep in the ES transport code. Setting all the
DEBUG's to trace doesn't produce any additional information. I googled but
couldn't find anything useful
(maybe https://forums.aws.amazon.com/message.jspa?messageID=296676 is
related?)

Since the data does seem to get uploaded to S3, I tried downloading it
manually to the ES data folder using S3Cmd and setting the gateway to
local, but then the startup process just hangs endlessly with these lines:

[2012-11-27 21:55:39,904][INFO ][node ] [Banner, Robert
Bruce] {0.19.11}[2365]: initializing ...
[2012-11-27 21:55:39,949][INFO ][plugins ] [Banner, Robert
Bruce] loaded [cloud-aws], sites []
[2012-11-27 21:55:44,027][DEBUG][discovery.zen.ping.multicast] [Banner,
Robert Bruce] using group [224.2.2.4], with port [54328], ttl [3], and
address [null]
[2012-11-27 21:55:44,031][DEBUG][discovery.zen.ping.unicast] [Banner,
Robert Bruce] using initial hosts [], with concurrent_connects [10]
[2012-11-27 21:55:44,033][DEBUG][discovery.ec2 ] [Banner, Robert
Bruce] using ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-11-27 21:55:44,039][DEBUG][discovery.zen.elect ] [Banner, Robert
Bruce] using minimum_master_nodes [-1]
[2012-11-27 21:55:44,040][DEBUG][discovery.zen.fd ] [Banner, Robert
Bruce] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries
[3]
[2012-11-27 21:55:44,069][DEBUG][discovery.zen.fd ] [Banner, Robert
Bruce] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries
[3]
[2012-11-27 21:55:45,290][DEBUG][discovery.ec2 ] [Banner, Robert
Bruce] using host_type [PRIVATE_IP], tags [{}], groups
[[production#elasticsearch]] with any_group [true], availability_zones [[]]
[2012-11-27 21:55:46,962][DEBUG][gateway.local ] [Banner, Robert
Bruce] using initial_shards [quorum], list_timeout [30s]

Again, setting everything to TRACE reveals no additional info.

Here's my full ES.yml:

gateway:

if I try to use S3:

type: s3
s3:
bucket: [redacted]

if I try to use local:

type: local

gateway.local.auto_import_dangled: yes

index:
store:
type: memory

discovery:
type: ec2

discovery.ec2.groups: [redacted]

cloud:
aws:
access_key: [redacted]
secret_key: [redacted]

cluster:
name: production-elasticsearch

path.data: /mnt/elasticsearch/
"""

Any ideas on how I can fix these issues or achieve my goal (big intake
cluster, small production cluster) in another way? If not, I'm probably
going to have to give up on ES :frowning:

Thanks!

-George

--

Some quick updates:

I realized I can also turn the Root Logger up to "Trace", which provided some new info:

On the S3 gateway:

[2012-11-27 23:44:28,022][WARN ][com.amazonaws.http.AmazonHttpClient] Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 23:44:28,022][DEBUG][org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager] Closing connections idle longer than 30 SECONDS
[2012-11-27 23:44:28,022][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] Closing connections idle longer than 30 SECONDS
[2012-11-27 23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager] Get connection: HttpRoute[{}->http://mybucket-index.s3.amazonaws.com], timeout = 50000
[2012-11-27 23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] [HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]] total kept alive: 0, total issued: 50, total allocated: 50 out of 50
[2012-11-27 23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] No free connections [HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]][null]
[2012-11-27 23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] Available capacity: 0 out of 50 [HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]][null]
[2012-11-27 23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] Need to wait for connection [HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]][null]

I also tried using HDFS as a gateway since I do have a cluster running, but that produced the error:

  1. Error injecting constructor, org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4
    at org.elasticsearch.gateway.hdfs.HdfsGateway.(Unknown Source)
    while locating org.elasticsearch.gateway.hdfs.HdfsGateway
    while locating org.elasticsearch.gateway.Gateway
    Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4
    at org.apache.hadoop.ipc.Client.call(Client.java:1030)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
    at $Proxy15.getProtocolVersion(Unknown Source)

I'm using the Cloudera CDH4 hadoop distribution, and downgrading is not really an option.

And it seems that for the local gateway, the line has to be:

gateway.local.auto_import_dangled: "yes"

(with quotes), otherwise the startup process chokes. But then with the quotes, the startup process just ignores my index anyway:

[2012-11-27 23:49:48,238][DEBUG][gateway.local.state.meta ] [Glob Herman] [labels] failed to find metadata for existing index location

The last option seems to be using GSH and sshfs to try to mount a common volume to all of my nodes and use gateway:fs

But that feels like quite a kludge…


George London
E: george.j.london@gmail.com
T: @rogueleaderr
B: rogueleaderr.tumblr.com

On Nov 27, 2012, at 2:00 PM, George London george.j.london@gmail.com wrote:

Hi All,

I'm brand new to ElasticSearch (it's great so far!) so forgive me if I'm missing something obvious.

I'm trying to integrate ElasticSearch into a production environment. I have a very volatile dataset that I need to fully reindex frequently, so I need a lot more computing power for indexing than for day-to-day serving. So I would like to have:

  1. A fairly large (~10 node) cluster of strong machines that I boot up and use only for indexing, and then shutdown (to save money).
  2. A small (~1-3) node cluster that I serve off of.

The hope is that I'll be able to have the large cluster store the indicies to S3, shut it down, and then start the small cluster and have it pull the indicies down from the same S3 bucket.

I managed to get the large cluster started, the indicies built, and the data (apparently saved to S3). But when I start the small cluster with the same gateway settings, it gets stuck in the startup with the logs showing:

[2012-11-27 21:44:25,928][TRACE][index.gateway.s3 ] [Catiana] [labels][3] recovering_files [129] with total_size [341.9mb], reusing_files [0] with reused_size [0b]
[2012-11-27 21:45:53,851][WARN ][com.amazonaws.http.AmazonHttpClient] Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,073][WARN ][com.amazonaws.http.AmazonHttpClient] Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,074][WARN ][com.amazonaws.http.AmazonHttpClient] Unable to execute HTTP request: Timeout waiting for connection

Over and over, until eventually it starts showing apache connection timeout errors and just loops endlessly through those. I've spent a couple hours trying to diagnos to no avail. It seems like the issue is either with S3 itself or somewhere fairly deep in the ES transport code. Setting all the DEBUG's to trace doesn't produce any additional information. I googled but couldn't find anything useful (maybe https://forums.aws.amazon.com/message.jspa?messageID=296676 is related?)

Since the data does seem to get uploaded to S3, I tried downloading it manually to the ES data folder using S3Cmd and setting the gateway to local, but then the startup process just hangs endlessly with these lines:

[2012-11-27 21:55:39,904][INFO ][node ] [Banner, Robert Bruce] {0.19.11}[2365]: initializing ...
[2012-11-27 21:55:39,949][INFO ][plugins ] [Banner, Robert Bruce] loaded [cloud-aws], sites []
[2012-11-27 21:55:44,027][DEBUG][discovery.zen.ping.multicast] [Banner, Robert Bruce] using group [224.2.2.4], with port [54328], ttl [3], and address [null]
[2012-11-27 21:55:44,031][DEBUG][discovery.zen.ping.unicast] [Banner, Robert Bruce] using initial hosts [], with concurrent_connects [10]
[2012-11-27 21:55:44,033][DEBUG][discovery.ec2 ] [Banner, Robert Bruce] using ping.timeout [3s], master_election.filter_client [true], master_election.filter_data [false]
[2012-11-27 21:55:44,039][DEBUG][discovery.zen.elect ] [Banner, Robert Bruce] using minimum_master_nodes [-1]
[2012-11-27 21:55:44,040][DEBUG][discovery.zen.fd ] [Banner, Robert Bruce] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-27 21:55:44,069][DEBUG][discovery.zen.fd ] [Banner, Robert Bruce] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-27 21:55:45,290][DEBUG][discovery.ec2 ] [Banner, Robert Bruce] using host_type [PRIVATE_IP], tags [{}], groups [[production#elasticsearch]] with any_group [true], availability_zones [[]]
[2012-11-27 21:55:46,962][DEBUG][gateway.local ] [Banner, Robert Bruce] using initial_shards [quorum], list_timeout [30s]

Again, setting everything to TRACE reveals no additional info.

Here's my full ES.yml:

gateway:

if I try to use S3:

type: s3
s3:
bucket: [redacted]

if I try to use local:

type: local

gateway.local.auto_import_dangled: yes

index:
store:
type: memory

discovery:
type: ec2

discovery.ec2.groups: [redacted]

cloud:
aws:
access_key: [redacted]
secret_key: [redacted]

cluster:
name: production-elasticsearch

path.data: /mnt/elasticsearch/
"""

Any ideas on how I can fix these issues or achieve my goal (big intake cluster, small production cluster) in another way? If not, I'm probably going to have to give up on ES :frowning:

Thanks!

-George

--

--

replied on IRC, the error message "Server IPC version 7 cannot communicate
with client version 4" is caused by elasticsearch-hadoop using an older
client than Cloudera CDH4. An interim build (elasticsearch-hadoop compiled
against hadoop 2.0.2 alpha) is available at

https://github.com/jprante/elasticsearch-hadoop

Installation:

bin/plugin -install jprante/elasticsearch-hadoop/1.3.0-SNAPSHOT

Cheers,

Jörg

On Wednesday, November 28, 2012 12:54:03 AM UTC+1, George London wrote:

Some quick updates:

I realized I can also turn the Root Logger up to "Trace", which provided
some new info:

On the S3 gateway:

[2012-11-27 23:44:28,022][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27
23:44:28,022][DEBUG][org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager]
Closing connections idle longer than 30 SECONDS
[2012-11-27
23:44:28,022][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute]
Closing connections idle longer than 30 SECONDS
[2012-11-27
23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager]
Get connection: HttpRoute[{}->http://mybucket-index.s3.amazonaws.com],
timeout = 50000
[2012-11-27
23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute]
[HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]] total kept alive:
0, total issued: 50, total allocated: 50 out of 50
[2012-11-27
23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] No
free connections [HttpRoute[{}->http://mybucket-index.s3.amazonaws.com]][null]

[2012-11-27
23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute]
Available capacity: 0 out of 50 [HttpRoute[{}->
http://mybucket-index.s3.amazonaws.com]][null]
[2012-11-27
23:44:28,023][DEBUG][org.apache.http.impl.conn.tsccm.ConnPoolByRoute] Need
to wait for connection [HttpRoute[{}->
http://mybucket-index.s3.amazonaws.com]][null]

I also tried using HDFS as a gateway since I do have a cluster running,
but that produced the error:

  1. Error injecting constructor, org.apache.hadoop.ipc.RemoteException:
    Server IPC version 7 cannot communicate with client version 4
    at org.elasticsearch.gateway.hdfs.HdfsGateway.(Unknown Source)
    while locating org.elasticsearch.gateway.hdfs.HdfsGateway
    while locating org.elasticsearch.gateway.Gateway
    Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7
    cannot communicate with client version 4
    at org.apache.hadoop.ipc.Client.call(Client.java:1030)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
    at $Proxy15.getProtocolVersion(Unknown Source)

I'm using the Cloudera CDH4 hadoop distribution, and downgrading is not
really an option.

And it seems that for the local gateway, the line has to be:

gateway.local.auto_import_dangled: "yes"

(with quotes), otherwise the startup process chokes. But then with the
quotes, the startup process just ignores my index anyway:

[2012-11-27 23:49:48,238][DEBUG][gateway.local.state.meta ] [Glob Herman]
[labels] failed to find metadata for existing index location

The last option seems to be using GSH and sshfs to try to mount a common
volume to all of my nodes and use gateway:fs

But that feels like quite a kludge…


George London
E: george....@gmail.com <javascript:>
T: @rogueleaderr
B: rogueleaderr.tumblr.com

On Nov 27, 2012, at 2:00 PM, George London <george....@gmail.com<javascript:>>
wrote:

Hi All,

I'm brand new to ElasticSearch (it's great so far!) so forgive me if I'm
missing something obvious.

I'm trying to integrate ElasticSearch into a production environment. I
have a very volatile dataset that I need to fully reindex frequently, so I
need a lot more computing power for indexing than for day-to-day serving.
So I would like to have:

  1. A fairly large (~10 node) cluster of strong machines that I boot up
    and use only for indexing, and then shutdown (to save money).
  2. A small (~1-3) node cluster that I serve off of.

The hope is that I'll be able to have the large cluster store the
indicies to S3, shut it down, and then start the small cluster and have it
pull the indicies down from the same S3 bucket.

I managed to get the large cluster started, the indicies built, and the
data (apparently saved to S3). But when I start the small cluster with the
same gateway settings, it gets stuck in the startup with the logs showing:

[2012-11-27 21:44:25,928][TRACE][index.gateway.s3 ] [Catiana]
[labels][3] recovering_files [129] with total_size [341.9mb], reusing_files
[0] with reused_size [0b]
[2012-11-27 21:45:53,851][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,073][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection
[2012-11-27 21:45:54,074][WARN ][com.amazonaws.http.AmazonHttpClient]
Unable to execute HTTP request: Timeout waiting for connection

Over and over, until eventually it starts showing apache connection
timeout errors and just loops endlessly through those. I've spent a couple
hours trying to diagnos to no avail. It seems like the issue is either with
S3 itself or somewhere fairly deep in the ES transport code. Setting all
the DEBUG's to trace doesn't produce any additional information. I googled
but couldn't find anything useful (maybe
https://forums.aws.amazon.com/message.jspa?messageID=296676 is related?)

Since the data does seem to get uploaded to S3, I tried downloading it
manually to the ES data folder using S3Cmd and setting the gateway to
local, but then the startup process just hangs endlessly with these lines:

[2012-11-27 21:55:39,904][INFO ][node ] [Banner,
Robert Bruce] {0.19.11}[2365]: initializing ...
[2012-11-27 21:55:39,949][INFO ][plugins ] [Banner,
Robert Bruce] loaded [cloud-aws], sites []
[2012-11-27 21:55:44,027][DEBUG][discovery.zen.ping.multicast] [Banner,
Robert Bruce] using group [224.2.2.4], with port [54328], ttl [3], and
address [null]
[2012-11-27 21:55:44,031][DEBUG][discovery.zen.ping.unicast] [Banner,
Robert Bruce] using initial hosts [], with concurrent_connects [10]
[2012-11-27 21:55:44,033][DEBUG][discovery.ec2 ] [Banner,
Robert Bruce] using ping.timeout [3s], master_election.filter_client
[true], master_election.filter_data [false]
[2012-11-27 21:55:44,039][DEBUG][discovery.zen.elect ] [Banner,
Robert Bruce] using minimum_master_nodes [-1]
[2012-11-27 21:55:44,040][DEBUG][discovery.zen.fd ] [Banner,
Robert Bruce] [master] uses ping_interval [1s], ping_timeout [30s],
ping_retries [3]
[2012-11-27 21:55:44,069][DEBUG][discovery.zen.fd ] [Banner,
Robert Bruce] [node ] uses ping_interval [1s], ping_timeout [30s],
ping_retries [3]
[2012-11-27 21:55:45,290][DEBUG][discovery.ec2 ] [Banner,
Robert Bruce] using host_type [PRIVATE_IP], tags [{}], groups
[[production#elasticsearch]] with any_group [true], availability_zones [[]]
[2012-11-27 21:55:46,962][DEBUG][gateway.local ] [Banner,
Robert Bruce] using initial_shards [quorum], list_timeout [30s]

Again, setting everything to TRACE reveals no additional info.

Here's my full ES.yml:

gateway:

if I try to use S3:

type: s3
s3:
bucket: [redacted]

if I try to use local:

type: local

gateway.local.auto_import_dangled: yes

index:
store:
type: memory

discovery:
type: ec2

discovery.ec2.groups: [redacted]

cloud:
aws:
access_key: [redacted]
secret_key: [redacted]

cluster:
name: production-elasticsearch

path.data: /mnt/elasticsearch/
"""

Any ideas on how I can fix these issues or achieve my goal (big intake
cluster, small production cluster) in another way? If not, I'm probably
going to have to give up on ES :frowning:

Thanks!

-George

--

--