Can't join cluster

Hello,

I have ES cluster of 2 machines and I'm having problems adding a third one.
I've tried unicast and multicast discovery with the same results.
The first two machines discover each other, but when I start the third one,
I can see that it creates TCP connections to port 9300 on the first two
machines,
but nothing more happens. The last thing on the console is this :

[2012-10-16 11:39:07,151][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: initializing ...
[2012-10-16 11:39:07,157][INFO ][plugins ]
[logger4.telstra] loaded [], sites [bigdesk, head, paramedic]
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: initialized
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: starting ...
[2012-10-16 11:39:07,196][INFO ][transport ]
[logger4.telstra] bound_address {inet[/10.100.0.40:9300]}, publish_address
{inet[/10.100.0.40:9300]}

Here is netstat output :

[11:54]root@logger4:/home/ndenev# netstat -anptcp
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 10.100.0.40.9300 10.100.0.20.23613
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12586
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.34997
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.60125
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.36215
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.47311
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12279
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32206
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32259
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.21844
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.19881
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.58187
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.28351
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.11270
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.60018
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.61148
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.24928
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.31942
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 . LISTEN

Nothing shows in the log files of the first two machines.

Any ideas how to debug this and what may be causing it?

--

Hello Nikolay,

You can try turning on more verbose logging from /etc/elasticsearch/logging.yml

From your netstat output I can see that another machine connects to
logger4's 9300. But can you connect to 10.100.0.20 from logger4?

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 12:55 PM, Nikolay Denev ndenev@gmail.com wrote:

Hello,

I have ES cluster of 2 machines and I'm having problems adding a third one.
I've tried unicast and multicast discovery with the same results.
The first two machines discover each other, but when I start the third one,
I can see that it creates TCP connections to port 9300 on the first two
machines,
but nothing more happens. The last thing on the console is this :

[2012-10-16 11:39:07,151][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: initializing ...
[2012-10-16 11:39:07,157][INFO ][plugins ]
[logger4.telstra] loaded , sites [bigdesk, head, paramedic]
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: initialized
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4.telstra] {0.19.10}[6829]: starting ...
[2012-10-16 11:39:07,196][INFO ][transport ]
[logger4.telstra] bound_address {inet[/10.100.0.40:9300]}, publish_address
{inet[/10.100.0.40:9300]}

Here is netstat output :

[11:54]root@logger4:/home/ndenev# netstat -anptcp
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 10.100.0.40.9300 10.100.0.20.23613
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12586
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.34997
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.60125
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.36215
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.47311
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12279
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32206
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32259
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.21844
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.19881
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.58187
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.28351
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.11270
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.60018
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.61148
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.24928
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.31942
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 . LISTEN

Nothing shows in the log files of the first two machines.

Any ideas how to debug this and what may be causing it?

--

--

Yep, I can connect to the other hosts from logger4 :

[14:41]root@logger4:/home/ndenev# telnet 10.100.0.20 9300
Trying 10.100.0.20...
Connected to 10.100.0.20.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[14:41]root@logger4:/home/ndenev# telnet 10.100.0.10 9300
Trying 10.100.0.10...
Connected to 10.100.0.10.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

With logging level upped to DEBUG I got this on logger2 (10.100.0.20) :

[2012-10-16 15:09:21,101][DEBUG][transport.netty ] [logger2 ] connected to node [[logger4 ][jh7DLxuaQ72l_gFK_uKhUQ][inet[/10.100.0.40:9300]]]

And this on logger4 :

[2012-10-16 15:09:18,995][INFO ][node ] [logger4 ] {0.19.10}[7828]: initializing ...
[2012-10-16 15:09:19,001][INFO ][plugins ] [logger4 ] loaded , sites [bigdesk, head, paramedic]
[2012-10-16 15:09:19,009][DEBUG][common.compress.lzf ] using [UnsafeChunkDecoder] decoder
[2012-10-16 15:09:19,018][DEBUG][common.compress ] failed to load xerial snappy-java
java.lang.ExceptionInInitializerError

at org.elasticsearch.common.compress.snappy.xerial.XerialSnappy.(XerialSnappy.java:41)

at org.elasticsearch.common.compress.CompressorFactory.(CompressorFactory.java:58)

at org.elasticsearch.node.internal.InternalNode.(InternalNode.java:123)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:67)

at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:200)

at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.NullPointerException

at java.lang.Throwable.printStackTrace(Throwable.java:477)

at java.lang.Throwable.printStackTrace(Throwable.java:468)

at org.xerial.snappy.Snappy.(Snappy.java:47)

... 7 more
[2012-10-16 15:09:19,030][DEBUG][env ] [logger4 ] using node location [[/tank/es-data/data/logger/nodes/0]], local_node_id [0]
[2012-10-16 15:09:19,625][DEBUG][threadpool ] [logger4 ] creating thread_pool [generic], type [cached], keep_alive [30s]
[2012-10-16 15:09:19,629][DEBUG][threadpool ] [logger4 ] creating thread_pool [index], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ] creating thread_pool [bulk], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ] creating thread_pool [get], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ] creating thread_pool [search], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ] creating thread_pool [percolate], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ] creating thread_pool [management], type [scaling], min [1], size [5], keep_alive [5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ] creating thread_pool [flush], type [scaling], min [1], size [10], keep_alive [5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ] creating thread_pool [merge], type [scaling], min [1], size [20], keep_alive [5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ] creating thread_pool [refresh], type [scaling], min [1], size [10], keep_alive [5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ] creating thread_pool [cache], type [scaling], min [1], size [4], keep_alive [5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ] creating thread_pool [snapshot], type [scaling], min [1], size [5], keep_alive [5m]
[2012-10-16 15:09:19,646][DEBUG][transport.netty ] [logger4 ] using worker_count[64], port[9300-9400], bind_host[null], publish_host[null], compress[false], connect_timeout[30s], connections_per_node[2/6/1], receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,652][DEBUG][discovery.zen.ping.multicast] [logger4 ] using group [224.2.2.4], with port [54328], ttl [3], and address [null]
[2012-10-16 15:09:19,654][DEBUG][discovery.zen.ping.unicast] [logger4 ] using initial hosts , with concurrent_connects [10]
[2012-10-16 15:09:19,655][DEBUG][discovery.zen ] [logger4 ] using ping.timeout [3s], master_election.filter_client [true], master_election.filter_data [false]
[2012-10-16 15:09:19,658][DEBUG][discovery.zen.elect ] [logger4 ] using minimum_master_nodes [-1]
[2012-10-16 15:09:19,659][DEBUG][discovery.zen.fd ] [logger4 ] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,661][DEBUG][discovery.zen.fd ] [logger4 ] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,674][DEBUG][monitor.jvm ] [logger4 ] enabled [true], last_gc_enabled [false], interval [1s], gc_threshold [{default=GcThreshold{name='default', warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}, ParNew=GcThreshold{name='ParNew', warnThreshold=1000, infoThreshold=700, debugThreshold=400}, ConcurrentMarkSweep=GcThreshold{name='ConcurrentMarkSweep', warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}}]
[2012-10-16 15:09:19,681][DEBUG][monitor.os ] [logger4 ] Using probe [org.elasticsearch.monitor.os.SigarOsProbe@111edceb] with refresh_interval [1s]
[2012-10-16 15:09:19,683][DEBUG][monitor.process ] [logger4 ] Using probe [org.elasticsearch.monitor.process.SigarProcessProbe@2e39d325] with refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.jvm ] [logger4 ] Using refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.network ] [logger4 ] Using probe [org.elasticsearch.monitor.network.SigarNetworkProbe@4e9e75f6] with refresh_interval [5s]
[2012-10-16 15:09:19,689][DEBUG][monitor.network ] [logger4 ] net_info
host [logger4]
bridge0 display_name [bridge0]

address [/10.100.0.40]

mtu [9000] multicast [true] ptp [false] loopback [false] up [true] virtual [false]
carp0 display_name [carp0]

address [/10.22.10.13]

mtu [1500] multicast [false] ptp [false] loopback [true] up [true] virtual [false]
lo0 display_name [lo0]

address [/127.0.0.1]

mtu [16384] multicast [true] ptp [false] loopback [true] up [true] virtual [false]
igb0 display_name [igb0]

address [/10.22.10.15]

mtu [1500] multicast [true] ptp [false] loopback [false] up [true] virtual [false]

[2012-10-16 15:09:19,692][DEBUG][monitor.fs ] [logger4 ] Using probe [org.elasticsearch.monitor.fs.SigarFsProbe@22343bb6] with refresh_interval [1s]
[2012-10-16 15:09:19,849][DEBUG][indices.store ] [logger4 ] using indices.store.throttle.type [none], with index.store.throttle.max_bytes_per_sec [0b]
[2012-10-16 15:09:19,853][DEBUG][cache.memory ] [logger4 ] using bytebuffer cache with small_buffer_size [1kb], large_buffer_size [1mb], small_cache_size [10mb], large_cache_size [500mb], direct [true]
[2012-10-16 15:09:19,875][DEBUG][cluster.routing.allocation.decider] [logger4 ] using node_concurrent_recoveries [2], node_initial_primaries_recoveries [4]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider] [logger4 ] using [cluster.routing.allocation.allow_rebalance] with [indices_all_active]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider] [logger4 ] using [cluster_concurrent_rebalance] with [2]
[2012-10-16 15:09:19,878][DEBUG][gateway.local ] [logger4 ] using initial_shards [quorum], list_timeout [30s]
[2012-10-16 15:09:19,928][DEBUG][indices.recovery ] [logger4 ] using max_size_per_sec[0b], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true]
[2012-10-16 15:09:19,950][DEBUG][http.netty ] [logger4 ] using max_chunk_size[8kb], max_header_size[8kb], max_initial_line_length[4kb], max_content_length[100mb], receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,954][DEBUG][indices.memory ] [logger4 ] using index_buffer_size [5.9gb], with min_shard_index_buffer_size [4mb], max_shard_index_buffer_size [512mb], shard_inactive_time [30m]
[2012-10-16 15:09:19,956][DEBUG][indices.cache.filter ] [logger4 ] using [node] weighted filter cache with size [20%], actual_size [11.9gb], expire [null], clean_interval [1m]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ] using gateway.local.auto_import_dangled [YES], with gateway.local.dangling_timeout [2h]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ] took 0s to load state
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.shards] [logger4 ] took 0s to load started shards state
[2012-10-16 15:09:18,998][DEBUG][bulk.udp ] [logger4 ] using enabled [false], host [null], port [9700-9800], bulk_actions [1000], bulk_size [5mb], flush_interval [5s], concurrent_requests [4]
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ] {0.19.10}[7828]: initialized
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ] {0.19.10}[7828]: starting ...
[2012-10-16 15:09:19,011][DEBUG][netty.channel.socket.nio.NioProviderMetadata] Couldn't determine the NIO constraint level from the system properties; using the safest level (2)
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil] Using select timeout of 500
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil] Epoll-bug workaround enabled = false
[2012-10-16 15:09:19,072][DEBUG][transport.netty ] [logger4 ] Bound to address [/10.100.0.40:9300]
[2012-10-16 15:09:19,073][INFO ][transport ] [logger4 ] bound_address {inet[/10.100.0.40:9300]}, publish_address {inet[/10.100.0.40:9300]}

The only thing that's diffrent is that logger2 is running with :

openjdk version "1.6.0_30"
OpenJDK Runtime Environment (build 1.6.0_30-b24)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

and logger4 :

openjdk version "1.6.0_32"
OpenJDK Runtime Environment (build 1.6.0_32-b26)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

The "xerial snappy-java" exception occurs on the other two hosts, without causing problems.

On Oct 16, 2012, at 3:26 PM, Radu Gheorghe radu.gheorghe@sematext.com wrote:

Hello Nikolay,

You can try turning on more verbose logging from /etc/elasticsearch/logging.yml

From your netstat output I can see that another machine connects to
logger4's 9300. But can you connect to 10.100.0.20 from logger4?

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 12:55 PM, Nikolay Denev ndenev@gmail.com wrote:

Hello,

I have ES cluster of 2 machines and I'm having problems adding a third one.
I've tried unicast and multicast discovery with the same results.
The first two machines discover each other, but when I start the third one,
I can see that it creates TCP connections to port 9300 on the first two
machines,
but nothing more happens. The last thing on the console is this :

[2012-10-16 11:39:07,151][INFO ][node ]
[logger4 ] {0.19.10}[6829]: initializing ...
[2012-10-16 11:39:07,157][INFO ][plugins ]
[logger4 ] loaded , sites [bigdesk, head, paramedic]
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4 ] {0.19.10}[6829]: initialized
[2012-10-16 11:39:08,123][INFO ][node ]
[logger4 ] {0.19.10}[6829]: starting ...
[2012-10-16 11:39:07,196][INFO ][transport ]
[logger4 ] bound_address {inet[/10.100.0.40:9300]}, publish_address
{inet[/10.100.0.40:9300]}

Here is netstat output :

[11:54]root@logger4:/home/ndenev# netstat -anptcp
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 10.100.0.40.9300 10.100.0.20.23613
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12586
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.34997
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.60125
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.36215
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.47311
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.12279
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32206
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.20.32259
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.21844
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.19881
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.58187
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.28351
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.11270
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.60018
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.61148
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.24928
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 10.100.0.10.31942
ESTABLISHED
tcp4 0 0 10.100.0.40.9300 . LISTEN

Nothing shows in the log files of the first two machines.

Any ideas how to debug this and what may be causing it?

--

--

--

Hello Nikolay,

On Tue, Oct 16, 2012 at 4:16 PM, Nikolay Denev ndenev@gmail.com wrote:

Yep, I can connect to the other hosts from logger4 :

[14:41]root@logger4:/home/ndenev# telnet 10.100.0.20 9300
Trying 10.100.0.20...
Connected to 10.100.0.20.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[14:41]root@logger4:/home/ndenev# telnet 10.100.0.10 9300
Trying 10.100.0.10...
Connected to 10.100.0.10.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

With logging level upped to DEBUG I got this on logger2 (10.100.0.20) :

[2012-10-16 15:09:21,101][DEBUG][transport.netty ] [logger2 ]
connected to node [[logger4
][jh7DLxuaQ72l_gFK_uKhUQ][inet[/10.100.0.40:9300]]]

And this on logger4 :

[2012-10-16 15:09:18,995][INFO ][node ] [logger4 ]
{0.19.10}[7828]: initializing ...
[2012-10-16 15:09:19,001][INFO ][plugins ] [logger4 ]
loaded , sites [bigdesk, head, paramedic]
[2012-10-16 15:09:19,009][DEBUG][common.compress.lzf ] using
[UnsafeChunkDecoder] decoder
[2012-10-16 15:09:19,018][DEBUG][common.compress ] failed to load
xerial snappy-java
java.lang.ExceptionInInitializerError
at
org.elasticsearch.common.compress.snappy.xerial.XerialSnappy.(XerialSnappy.java:41)
at
org.elasticsearch.common.compress.CompressorFactory.(CompressorFactory.java:58)
at
org.elasticsearch.node.internal.InternalNode.(InternalNode.java:123)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:67)
at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:200)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.NullPointerException
at java.lang.Throwable.printStackTrace(Throwable.java:477)
at java.lang.Throwable.printStackTrace(Throwable.java:468)
at org.xerial.snappy.Snappy.(Snappy.java:47)
... 7 more
[2012-10-16 15:09:19,030][DEBUG][env ] [logger4 ] using
node location [[/tank/es-data/data/logger/nodes/0]], local_node_id [0]
[2012-10-16 15:09:19,625][DEBUG][threadpool ] [logger4 ]
creating thread_pool [generic], type [cached], keep_alive [30s]
[2012-10-16 15:09:19,629][DEBUG][threadpool ] [logger4 ]
creating thread_pool [index], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [bulk], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [get], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [search], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [percolate], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [management], type [scaling], min [1], size [5],
keep_alive [5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ]
creating thread_pool [flush], type [scaling], min [1], size [10], keep_alive
[5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ]
creating thread_pool [merge], type [scaling], min [1], size [20], keep_alive
[5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [refresh], type [scaling], min [1], size [10],
keep_alive [5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [cache], type [scaling], min [1], size [4], keep_alive
[5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [snapshot], type [scaling], min [1], size [5],
keep_alive [5m]
[2012-10-16 15:09:19,646][DEBUG][transport.netty ] [logger4 ] using
worker_count[64], port[9300-9400], bind_host[null], publish_host[null],
compress[false], connect_timeout[30s], connections_per_node[2/6/1],
receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,652][DEBUG][discovery.zen.ping.multicast] [logger4 ]
using group [224.2.2.4], with port [54328], ttl [3], and address [null]
[2012-10-16 15:09:19,654][DEBUG][discovery.zen.ping.unicast] [logger4 ]
using initial hosts , with concurrent_connects [10]
[2012-10-16 15:09:19,655][DEBUG][discovery.zen ] [logger4 ] using
ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-10-16 15:09:19,658][DEBUG][discovery.zen.elect ] [logger4 ] using
minimum_master_nodes [-1]
[2012-10-16 15:09:19,659][DEBUG][discovery.zen.fd ] [logger4 ]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,661][DEBUG][discovery.zen.fd ] [logger4 ] [node
] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,674][DEBUG][monitor.jvm ] [logger4 ]
enabled [true], last_gc_enabled [false], interval [1s], gc_threshold
[{default=GcThreshold{name='default', warnThreshold=10000,
infoThreshold=5000, debugThreshold=2000}, ParNew=GcThreshold{name='ParNew',
warnThreshold=1000, infoThreshold=700, debugThreshold=400},
ConcurrentMarkSweep=GcThreshold{name='ConcurrentMarkSweep',
warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}}]
[2012-10-16 15:09:19,681][DEBUG][monitor.os ] [logger4 ] Using
probe [org.elasticsearch.monitor.os.SigarOsProbe@111edceb] with
refresh_interval [1s]
[2012-10-16 15:09:19,683][DEBUG][monitor.process ] [logger4 ] Using
probe [org.elasticsearch.monitor.process.SigarProcessProbe@2e39d325] with
refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.jvm ] [logger4 ] Using
refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.network ] [logger4 ] Using
probe [org.elasticsearch.monitor.network.SigarNetworkProbe@4e9e75f6] with
refresh_interval [5s]
[2012-10-16 15:09:19,689][DEBUG][monitor.network ] [logger4 ]
net_info
host [logger4]
bridge0 display_name [bridge0]
address [/10.100.0.40]
mtu [9000] multicast [true] ptp [false] loopback [false] up [true] virtual
[false]
carp0 display_name [carp0]
address [/10.22.10.13]
mtu [1500] multicast [false] ptp [false] loopback [true] up [true] virtual
[false]
lo0 display_name [lo0]
address [/127.0.0.1]
mtu [16384] multicast [true] ptp [false] loopback [true] up [true] virtual
[false]
igb0 display_name [igb0]
address [/10.22.10.15]
mtu [1500] multicast [true] ptp [false] loopback [false] up [true] virtual
[false]

[2012-10-16 15:09:19,692][DEBUG][monitor.fs ] [logger4 ] Using
probe [org.elasticsearch.monitor.fs.SigarFsProbe@22343bb6] with
refresh_interval [1s]
[2012-10-16 15:09:19,849][DEBUG][indices.store ] [logger4 ] using
indices.store.throttle.type [none], with
index.store.throttle.max_bytes_per_sec [0b]
[2012-10-16 15:09:19,853][DEBUG][cache.memory ] [logger4 ] using
bytebuffer cache with small_buffer_size [1kb], large_buffer_size [1mb],
small_cache_size [10mb], large_cache_size [500mb], direct [true]
[2012-10-16 15:09:19,875][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using node_concurrent_recoveries [2],
node_initial_primaries_recoveries [4]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using [cluster.routing.allocation.allow_rebalance] with
[indices_all_active]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using [cluster_concurrent_rebalance] with [2]
[2012-10-16 15:09:19,878][DEBUG][gateway.local ] [logger4 ] using
initial_shards [quorum], list_timeout [30s]
[2012-10-16 15:09:19,928][DEBUG][indices.recovery ] [logger4 ] using
max_size_per_sec[0b], concurrent_streams [3], file_chunk_size [512kb],
translog_size [512kb], translog_ops [1000], and compress [true]
[2012-10-16 15:09:19,950][DEBUG][http.netty ] [logger4 ] using
max_chunk_size[8kb], max_header_size[8kb], max_initial_line_length[4kb],
max_content_length[100mb], receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,954][DEBUG][indices.memory ] [logger4 ] using
index_buffer_size [5.9gb], with min_shard_index_buffer_size [4mb],
max_shard_index_buffer_size [512mb], shard_inactive_time [30m]
[2012-10-16 15:09:19,956][DEBUG][indices.cache.filter ] [logger4 ] using
[node] weighted filter cache with size [20%], actual_size [11.9gb], expire
[null], clean_interval [1m]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ] using
gateway.local.auto_import_dangled [YES], with gateway.local.dangling_timeout
[2h]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ] took
0s to load state
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.shards] [logger4 ] took
0s to load started shards state
[2012-10-16 15:09:18,998][DEBUG][bulk.udp ] [logger4 ] using
enabled [false], host [null], port [9700-9800], bulk_actions [1000],
bulk_size [5mb], flush_interval [5s], concurrent_requests [4]
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ]
{0.19.10}[7828]: initialized
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ]
{0.19.10}[7828]: starting ...
[2012-10-16
15:09:19,011][DEBUG][netty.channel.socket.nio.NioProviderMetadata] Couldn't
determine the NIO constraint level from the system properties; using the
safest level (2)
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil]
Using select timeout of 500
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil]
Epoll-bug workaround enabled = false
[2012-10-16 15:09:19,072][DEBUG][transport.netty ] [logger4 ] Bound
to address [/10.100.0.40:9300]
[2012-10-16 15:09:19,073][INFO ][transport ] [logger4 ]
bound_address {inet[/10.100.0.40:9300]}, publish_address
{inet[/10.100.0.40:9300]}

And what do you get after that?

For me, when I have a node that doesn't discover any master I get
something like:

[2012-10-16 16:38:31,658][INFO ][transport ]
[Powerhouse] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.199.101.94:9300]}
[2012-10-16 16:38:38,688][DEBUG][discovery.zen ]
[Powerhouse] filtered ping responses: (filter_client[true],
filter_data[false]) {none}
[2012-10-16 16:38:38,694][DEBUG][cluster.service ]
[Powerhouse] processing [zen-disco-join (elected_as_master)]: execute

And for the node that does discover a master, I get:

[2012-10-16 16:39:30,520][INFO ][transport ] [Marrow]
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address
{inet[/10.199.101.94:9301]}
[2012-10-16 16:39:37,550][DEBUG][discovery.zen ] [Marrow]
filtered ping responses: (filter_client[true], filter_data[false])
--> target [[Mister
Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]], master
[[Mister Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]]
[2012-10-16 16:39:37,563][DEBUG][transport.netty ] [Marrow]
connected to node [[Mister
Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]]

You can see there's a 7s interval between binding to the transport
address and deciding on the master. That's the
discovery.zen.ping.timeout value from the ES conf. By default it's 3s.
Have you tried increasing the default value? Although 3s should be
plenty....

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

16 октомври 2012, вторник, 16:46:38 UTC+3, Radu Gheorghe написа:

Hello Nikolay,

On Tue, Oct 16, 2012 at 4:16 PM, Nikolay Denev <nde...@gmail.com<javascript:>>
wrote:

Yep, I can connect to the other hosts from logger4 :

[14:41]root@logger4:/home/ndenev# telnet 10.100.0.20 9300
Trying 10.100.0.20...
Connected to 10.100.0.20.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[14:41]root@logger4:/home/ndenev# telnet 10.100.0.10 9300
Trying 10.100.0.10...
Connected to 10.100.0.10.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

With logging level upped to DEBUG I got this on logger2 (10.100.0.20) :

[2012-10-16 15:09:21,101][DEBUG][transport.netty ] [logger2 ]
connected to node [[logger4
][jh7DLxuaQ72l_gFK_uKhUQ][inet[/10.100.0.40:9300]]]

And this on logger4 :

[2012-10-16 15:09:18,995][INFO ][node ] [logger4 ]
{0.19.10}[7828]: initializing ...
[2012-10-16 15:09:19,001][INFO ][plugins ] [logger4 ]
loaded , sites [bigdesk, head, paramedic]
[2012-10-16 15:09:19,009][DEBUG][common.compress.lzf ] using
[UnsafeChunkDecoder] decoder
[2012-10-16 15:09:19,018][DEBUG][common.compress ] failed to
load
xerial snappy-java
java.lang.ExceptionInInitializerError
at

org.elasticsearch.common.compress.snappy.xerial.XerialSnappy.(XerialSnappy.java:41)

at

org.elasticsearch.common.compress.CompressorFactory.(CompressorFactory.java:58)

at

org.elasticsearch.node.internal.InternalNode.(InternalNode.java:123)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:67)
at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:200)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.NullPointerException
at java.lang.Throwable.printStackTrace(Throwable.java:477)
at java.lang.Throwable.printStackTrace(Throwable.java:468)
at org.xerial.snappy.Snappy.(Snappy.java:47)
... 7 more
[2012-10-16 15:09:19,030][DEBUG][env ] [logger4 ]
using
node location [[/tank/es-data/data/logger/nodes/0]], local_node_id [0]
[2012-10-16 15:09:19,625][DEBUG][threadpool ] [logger4 ]
creating thread_pool [generic], type [cached], keep_alive [30s]
[2012-10-16 15:09:19,629][DEBUG][threadpool ] [logger4 ]
creating thread_pool [index], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [bulk], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [get], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [search], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [percolate], type [cached], keep_alive [5m]
[2012-10-16 15:09:19,630][DEBUG][threadpool ] [logger4 ]
creating thread_pool [management], type [scaling], min [1], size [5],
keep_alive [5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ]
creating thread_pool [flush], type [scaling], min [1], size [10],
keep_alive
[5m]
[2012-10-16 15:09:19,632][DEBUG][threadpool ] [logger4 ]
creating thread_pool [merge], type [scaling], min [1], size [20],
keep_alive
[5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [refresh], type [scaling], min [1], size [10],
keep_alive [5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [cache], type [scaling], min [1], size [4],
keep_alive
[5m]
[2012-10-16 15:09:19,633][DEBUG][threadpool ] [logger4 ]
creating thread_pool [snapshot], type [scaling], min [1], size [5],
keep_alive [5m]
[2012-10-16 15:09:19,646][DEBUG][transport.netty ] [logger4 ]
using
worker_count[64], port[9300-9400], bind_host[null], publish_host[null],
compress[false], connect_timeout[30s], connections_per_node[2/6/1],
receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,652][DEBUG][discovery.zen.ping.multicast] [logger4
]
using group [224.2.2.4], with port [54328], ttl [3], and address [null]
[2012-10-16 15:09:19,654][DEBUG][discovery.zen.ping.unicast] [logger4 ]
using initial hosts , with concurrent_connects [10]
[2012-10-16 15:09:19,655][DEBUG][discovery.zen ] [logger4 ]
using
ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-10-16 15:09:19,658][DEBUG][discovery.zen.elect ] [logger4 ]
using
minimum_master_nodes [-1]
[2012-10-16 15:09:19,659][DEBUG][discovery.zen.fd ] [logger4 ]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,661][DEBUG][discovery.zen.fd ] [logger4 ]
[node
] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-10-16 15:09:19,674][DEBUG][monitor.jvm ] [logger4 ]
enabled [true], last_gc_enabled [false], interval [1s], gc_threshold
[{default=GcThreshold{name='default', warnThreshold=10000,
infoThreshold=5000, debugThreshold=2000},
ParNew=GcThreshold{name='ParNew',
warnThreshold=1000, infoThreshold=700, debugThreshold=400},
ConcurrentMarkSweep=GcThreshold{name='ConcurrentMarkSweep',
warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}}]
[2012-10-16 15:09:19,681][DEBUG][monitor.os ] [logger4 ]
Using
probe [org.elasticsearch.monitor.os.SigarOsProbe@111edceb] with
refresh_interval [1s]
[2012-10-16 15:09:19,683][DEBUG][monitor.process ] [logger4 ]
Using
probe [org.elasticsearch.monitor.process.SigarProcessProbe@2e39d325]
with
refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.jvm ] [logger4 ]
Using
refresh_interval [1s]
[2012-10-16 15:09:19,685][DEBUG][monitor.network ] [logger4 ]
Using
probe [org.elasticsearch.monitor.network.SigarNetworkProbe@4e9e75f6]
with
refresh_interval [5s]
[2012-10-16 15:09:19,689][DEBUG][monitor.network ] [logger4 ]
net_info
host [logger4]
bridge0 display_name [bridge0]
address [/10.100.0.40]
mtu [9000] multicast [true] ptp [false] loopback [false] up [true]
virtual
[false]
carp0 display_name [carp0]
address [/10.22.10.13]
mtu [1500] multicast [false] ptp [false] loopback [true] up [true]
virtual
[false]
lo0 display_name [lo0]
address [/127.0.0.1]
mtu [16384] multicast [true] ptp [false] loopback [true] up [true]
virtual
[false]
igb0 display_name [igb0]
address [/10.22.10.15]
mtu [1500] multicast [true] ptp [false] loopback [false] up [true]
virtual
[false]

[2012-10-16 15:09:19,692][DEBUG][monitor.fs ] [logger4 ]
Using
probe [org.elasticsearch.monitor.fs.SigarFsProbe@22343bb6] with
refresh_interval [1s]
[2012-10-16 15:09:19,849][DEBUG][indices.store ] [logger4 ]
using
indices.store.throttle.type [none], with
index.store.throttle.max_bytes_per_sec [0b]
[2012-10-16 15:09:19,853][DEBUG][cache.memory ] [logger4 ]
using
bytebuffer cache with small_buffer_size [1kb], large_buffer_size [1mb],
small_cache_size [10mb], large_cache_size [500mb], direct [true]
[2012-10-16 15:09:19,875][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using node_concurrent_recoveries [2],
node_initial_primaries_recoveries [4]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using [cluster.routing.allocation.allow_rebalance] with
[indices_all_active]
[2012-10-16 15:09:19,876][DEBUG][cluster.routing.allocation.decider]
[logger4 ] using [cluster_concurrent_rebalance] with [2]
[2012-10-16 15:09:19,878][DEBUG][gateway.local ] [logger4 ]
using
initial_shards [quorum], list_timeout [30s]
[2012-10-16 15:09:19,928][DEBUG][indices.recovery ] [logger4 ]
using
max_size_per_sec[0b], concurrent_streams [3], file_chunk_size [512kb],
translog_size [512kb], translog_ops [1000], and compress [true]
[2012-10-16 15:09:19,950][DEBUG][http.netty ] [logger4 ]
using
max_chunk_size[8kb], max_header_size[8kb], max_initial_line_length[4kb],
max_content_length[100mb], receive_predictor[512kb->512kb]
[2012-10-16 15:09:19,954][DEBUG][indices.memory ] [logger4 ]
using
index_buffer_size [5.9gb], with min_shard_index_buffer_size [4mb],
max_shard_index_buffer_size [512mb], shard_inactive_time [30m]
[2012-10-16 15:09:19,956][DEBUG][indices.cache.filter ] [logger4 ]
using
[node] weighted filter cache with size [20%], actual_size [11.9gb],
expire
[null], clean_interval [1m]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ]
using
gateway.local.auto_import_dangled [YES], with
gateway.local.dangling_timeout
[2h]
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.meta ] [logger4 ]
took
0s to load state
[2012-10-16 15:09:18,996][DEBUG][gateway.local.state.shards] [logger4 ]
took
0s to load started shards state
[2012-10-16 15:09:18,998][DEBUG][bulk.udp ] [logger4 ]
using
enabled [false], host [null], port [9700-9800], bulk_actions [1000],
bulk_size [5mb], flush_interval [5s], concurrent_requests [4]
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ]
{0.19.10}[7828]: initialized
[2012-10-16 15:09:18,999][INFO ][node ] [logger4 ]
{0.19.10}[7828]: starting ...
[2012-10-16
15:09:19,011][DEBUG][netty.channel.socket.nio.NioProviderMetadata]
Couldn't
determine the NIO constraint level from the system properties; using the
safest level (2)
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil]
Using select timeout of 500
[2012-10-16 15:09:19,016][DEBUG][netty.channel.socket.nio.SelectorUtil]
Epoll-bug workaround enabled = false
[2012-10-16 15:09:19,072][DEBUG][transport.netty ] [logger4 ]
Bound
to address [/10.100.0.40:9300]
[2012-10-16 15:09:19,073][INFO ][transport ] [logger4 ]
bound_address {inet[/10.100.0.40:9300]}, publish_address
{inet[/10.100.0.40:9300]}

And what do you get after that?

No more messages, just hangs there, and the tcp sessions stay ESTABLISHED.

For me, when I have a node that doesn't discover any master I get
something like:

[2012-10-16 16:38:31,658][INFO ][transport ]
[Powerhouse] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.199.101.94:9300]}
[2012-10-16 16:38:38,688][DEBUG][discovery.zen ]
[Powerhouse] filtered ping responses: (filter_client[true],
filter_data[false]) {none}
[2012-10-16 16:38:38,694][DEBUG][cluster.service ]
[Powerhouse] processing [zen-disco-join (elected_as_master)]: execute

And for the node that does discover a master, I get:

[2012-10-16 16:39:30,520][INFO ][transport ] [Marrow]
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address
{inet[/10.199.101.94:9301]}
[2012-10-16 16:39:37,550][DEBUG][discovery.zen ] [Marrow]
filtered ping responses: (filter_client[true], filter_data[false])
--> target [[Mister
Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]], master
[[Mister Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]]
[2012-10-16 16:39:37,563][DEBUG][transport.netty ] [Marrow]
connected to node [[Mister
Jip][Z-hmAO-qST2XuSk7TovAtg][inet[/10.199.101.94:9300]]]

You can see there's a 7s interval between binding to the transport
address and deciding on the master. That's the
discovery.zen.ping.timeout value from the ES conf. By default it's 3s.
Have you tried increasing the default value? Although 3s should be
plenty....

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

I've tried with 30s timeout, with no difference.

--