ES snapshots


(IT2) #1

Hello there!
I have cluster with ES 1.3.8, kibana3 and Logstash 1.5.0 on board.
I also have a remote space for snapshots. It's mounted by sshfs for whole cluster.
but when I starting up a snapshot, my cluster is stopping. Logs are not collecting.
And only after snapshot and when I restart services, everything back in normal.
So when snapshont is making (~2 hours) - I havenot logs for this time interval and this is sad :frowning:
Please, any advice. thanks!


(Mark Walkom) #2

Are any of the APIs responding, _cat/nodes, hot_threads etc etc?


(IT2) #3

When I will make next snapshot (soon), I will see. THanks!


(IT2) #4

I started a snapshot and everything broked.
But cat says everything is ok
curl localhost:9200/_cat/nodes?h=n,ip,port,v,d,hp,m
data2 192.168.98.8 9300 1.3.2 398.6gb 71 -
data1 192.168.98.7 9300 1.3.2 385gb 42 -
logstash 127.0.0.1 9300 1.3.8 17.7gb 4 *
logstash-localhost.localdomain-2447-7946 127.0.0.1 9301 1.5.1 -1b 33 -


(Mark Walkom) #5

What state does your cluster go into?
Are you monitoring load on the nodes?


(IT2) #6

I monitor my cluster by kopf plugin.
When snapshot begin, nothing changes. Everything is green, and i havenot load on my nodes.


(Magnus Bäck) #7

When snapshot begin, nothing changes. Everything is green, and i havenot load on my nodes.

What if you look at the snapshot status? Snapshotting doesn't necessarily place a heavy load on the nodes, and it definitely shouldn't affect the cluster's status (if that's what you were referring to with "green").


(IT2) #8

curl -XGET "localhost:9200/_snapshot/4tbtest/15.03.20-22/_status"

{"snapshots":[{"snapshot":"15.03.20-22","repository":"4tbtest","state":"STARTED","shards_stats":{"initializing":4,"started":8,"finalizing":0,"done":0,"failed":0,"total":12},"stats":{"number_of_files":2326,"processed_files":185,"total_size_in_bytes":77395626630,"processed_size_in_bytes":394038158,"start_time_in_millis":1432734868342,"time_in_millis":0},"indices":{"logstash-2015.03.20":{"shards_stats":{"initializing":1,"started":3,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":881,"processed_files":76,"total_size_in_bytes":30777452788,"processed_size_in_bytes":92625050,"start_time_in_millis":1432734868088,"time_in_millis":257},"shards":{"0":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"1":{"stage":"STARTED","stats":{"number_of_files":211,"processed_files":76,"total_size_in_bytes":10237498758,"processed_size_in_bytes":92625050,"start_time_in_millis":1432734868088,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"2":{"stage":"STARTED","stats":{"number_of_files":329,"processed_files":0,"total_size_in_bytes":10268019904,"processed_size_in_bytes":0,"start_time_in_millis":1432734868345,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"3":{"stage":"STARTED","stats":{"number_of_files":341,"processed_files":0,"total_size_in_bytes":10271934126,"processed_size_in_bytes":0,"start_time_in_millis":1432734868091,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"}}},"logstash-2015.03.22":{"shards_stats":{"initializing":1,"started":3,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":775,"processed_files":109,"total_size_in_bytes":26597219323,"processed_size_in_bytes":301413108,"start_time_in_millis":1432734868342,"time_in_millis":0},"shards":{"0":{"stage":"STARTED","stats":{"number_of_files":328,"processed_files":0,"total_size_in_bytes":8877532129,"processed_size_in_bytes":0,"start_time_in_millis":1432734868094,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"1":{"stage":"STARTED","stats":{"number_of_files":230,"processed_files":0,"total_size_in_bytes":8861607395,"processed_size_in_bytes":0,"start_time_in_millis":1432734868347,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"2":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"3":{"stage":"STARTED","stats":{"number_of_files":217,"processed_files":109,"total_size_in_bytes":8858079799,"processed_size_in_bytes":301413108,"start_time_in_millis":1432734868342,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"}}},"logstash-2015.03.21":{"shards_stats":{"initializing":2,"started":2,"finalizing":0,"done":0,"failed":0,"total":4},"stats":{"number_of_files":670,"processed_files":0,"total_size_in_bytes":20020954519,"processed_size_in_bytes":0,"start_time_in_millis":1432734868098,"time_in_millis":246},"shards":{"0":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"1":{"stage":"INIT","stats":{"number_of_files":0,"processed_files":0,"total_size_in_bytes":0,"processed_size_in_bytes":0,"start_time_in_millis":0,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"},"2":{"stage":"STARTED","stats":{"number_of_files":332,"processed_files":0,"total_size_in_bytes":10000314469,"processed_size_in_bytes":0,"start_time_in_millis":1432734868098,"time_in_millis":0},"node":"XNs4qFu2SNK126PyfAOZdg"},"3":{"stage":"STARTED","stats":{"number_of_files":338,"processed_files":0,"total_size_in_bytes":10020640050,"processed_size_in_bytes":0,"start_time_in_millis":1432734868344,"time_in_millis":0},"node":"v7d4Ya0jTcWzFObkh9UDJA"}}}}}]}

and now my cluster looks like: http://prntscr.com/79zoa1


(Magnus Bäck) #9

curl -XGET "localhost:9200/_snapshot/4tbtest/15.03.20-22/_status"

That looks perfectly normal.

and now my cluster looks like: http://prntscr.com/79zoa1

And you're saying that the drop in messages correlate with the start of the snapshot activity...?


(IT2) #10

And you're saying that the drop in messages correlate with the start of the snapshot activity...?

Yeap, you right. I cant associate it with something else.
Maybe something with my config?

cluster.name: elasticsearch2
node.name: "logstash"

script.disable_dynamic: true

#This forces the JVM to allocate all of ES_MIN_MEM immediately.
bootstrap.mlockall: true

node.master: true
node.data: false

index.number_of_shards: 4
index.number_of_replicas: 0
index.refresh_interval: 30s

# Search pool
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300

# Index pool
threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 100

# Indices settings
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

# Cache Sizes
indices.fielddata.cache.size: 15%
indices.fielddata.cache.expire: 6h
indices.cache.filter.size: 15%
indices.cache.filter.expire: 6h

# Indexing Settings for Writes
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

http.cors.allow-origin: "*"
http.cors.enabled: true

(IT2) #11

I also want to describe the whole process.
I'm snapshoting my old indexes which are closed. So I'm opening 3-4 oldest indexes, and then starting to snapshot them.
So when snapshot finishes, i'm deleting these indexes and then rebooting elastic and logstash.
I also noticed if I\m not delete these indexes. services may not start. I mean they start, but messages not collecting. :frowning:


(system) #12