ES snapshots

Hello there!
I have cluster with ES 1.3.8, kibana3 and Logstash 1.5.0 on board.
I also have a remote space for snapshots. It's mounted by sshfs for whole cluster.
but when I starting up a snapshot, my cluster is stopping. Logs are not collecting.
And only after snapshot and when I restart services, everything back in normal.
So when snapshont is making (~2 hours) - I havenot logs for this time interval and this is sad :frowning:
Please, any advice. thanks!

Are any of the APIs responding, _cat/nodes, hot_threads etc etc?

When I will make next snapshot (soon), I will see. THanks!

I started a snapshot and everything broked.
But cat says everything is ok
curl localhost:9200/_cat/nodes?h=n,ip,port,v,d,hp,m
data2 192.168.98.8 9300 1.3.2 398.6gb 71 -
data1 192.168.98.7 9300 1.3.2 385gb 42 -
logstash 127.0.0.1 9300 1.3.8 17.7gb 4 *
logstash-localhost.localdomain-2447-7946 127.0.0.1 9301 1.5.1 -1b 33 -

What state does your cluster go into?
Are you monitoring load on the nodes?

I monitor my cluster by kopf plugin.
When snapshot begin, nothing changes. Everything is green, and i havenot load on my nodes.

When snapshot begin, nothing changes. Everything is green, and i havenot load on my nodes.

What if you look at the snapshot status? Snapshotting doesn't necessarily place a heavy load on the nodes, and it definitely shouldn't affect the cluster's status (if that's what you were referring to with "green").

curl -XGET "localhost:9200/_snapshot/4tbtest/15.03.20-22/_status"

{"snapshots":[{"snapshot":"15.03.20-22","repository":"4tbtest","state":"STARTED","shards_stats":{"initializing":4,"started":8,"finalizing":0,"done":0,"failed":0,"total":12},"stats":{"number_of_files":2326,"processed_files":185,"total_size_in_bytes":77395626630,"processed_size_in_bytes":394038158,"start_time_in_millis":1432734868342,"time_in_millis":0},"indices":{"logstash-2015.03.20":{"shards_stats":{"initializing":1,"started":3,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":881,"processed_files":76,"total_size_in_bytes":30777452788,"processed_size_in_bytes":92625050,"start_time_in_millis":1432734868088,"time_in_millis":257},"shards":{"0":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"1":{"stage":"STARTED","stats":{"number_of_files":211,"processed_files":76,"total_size_in_bytes":10237498758,"processed_size_in_bytes":92625050,"start_time_in_millis":1432734868088,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"2":{"stage":"STARTED","stats":{"number_of_files":329,"processed_files":0,"total_size_in_bytes":10268019904,"processed_size_in_bytes":0,"start_time_in_millis":1432734868345,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"3":{"stage":"STARTED","stats":{"number_of_files":341,"processed_files":0,"total_size_in_bytes":10271934126,"processed_size_in_bytes":0,"start_time_in_millis":1432734868091,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"}}},"logstash-2015.03.22":{"shards_stats":{"initializing":1,"started":3,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":775,"processed_files":109,"total_size_in_bytes":26597219323,"processed_size_in_bytes":301413108,"start_time_in_millis":1432734868342,"time_in_millis":0},"shards":{"0":{"stage":"STARTED","stats":{"number_of_files":328,"processed_files":0,"total_size_in_bytes":8877532129,"processed_size_in_bytes":0,"start_time_in_millis":1432734868094,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"1":{"stage":"STARTED","stats":{"number_of_files":230,"processed_files":0,"total_size_in_bytes":8861607395,"processed_size_in_bytes":0,"start_time_in_millis":1432734868347,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"2":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"3":{"stage":"STARTED","stats":{"number_of_files":217,"processed_files":109,"total_size_in_bytes":8858079799,"processed_size_in_bytes":301413108,"start_time_in_millis":1432734868342,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"}}},"logstash-2015.03.21":{"shards_stats":{"initializing":2,"started":2,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":670,"processed_files":0,"total_size_in_bytes":20020954519,"processed_size_in_bytes":0,"start_time_in_millis":1432734868098,"time_in_millis":246},"shards":{"0":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"1":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"2":{"stage":"STARTED","stats":{"number_of_files":332,"processed_files":0,"total_size_in_bytes":10000314469,"processed_size_in_bytes":0,"start_time_in_millis":1432734868098,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"3":{"stage":"STARTED","stats":{"number_of_files":338,"processed_files":0,"total_size_in_bytes":10020640050,"processed_size_in_bytes":0,"start_time_in_millis":1432734868344,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"}}}}}]}

and now my cluster looks like: http://prntscr.com/79zoa1

curl -XGET "localhost:9200/_snapshot/4tbtest/15.03.20-22/_status"

That looks perfectly normal.

and now my cluster looks like: Screenshot by Lightshot

And you're saying that the drop in messages correlate with the start of the snapshot activity...?

And you're saying that the drop in messages correlate with the start of the snapshot activity...?

Yeap, you right. I cant associate it with something else.
Maybe something with my config?

cluster.name: elasticsearch2
node.name: "logstash"

script.disable_dynamic: true

#This forces the JVM to allocate all of ES_MIN_MEM immediately.
bootstrap.mlockall: true

node.master: true
node.data: false

index.number_of_shards: 4
index.number_of_replicas: 0
index.refresh_interval: 30s

# Search pool
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

# Index pool
threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 100

# Indices settings
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

# Cache Sizes
indices.fielddata.cache.size: 15%
indices.fielddata.cache.expire: 6h
indices.cache.filter.size: 15%
indices.cache.filter.expire: 6h

# Indexing Settings for Writes
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

http.cors.allow-origin: "*"
http.cors.enabled: true

I also want to describe the whole process.
I'm snapshoting my old indexes which are closed. So I'm opening 3-4 oldest indexes, and then starting to snapshot them.
So when snapshot finishes, i'm deleting these indexes and then rebooting elastic and logstash.
I also noticed if I\m not delete these indexes. services may not start. I mean they start, but messages not collecting. :frowning: