ElasticSearch 1.7.3 vs 2.0 vs 2.1 missing data

Hi.
I have ES cluster running on 1.7.3 storing logs parsed by logstash.
I want to upgrade to ES 2.x, so I ran migration plugin to check what I needed to change.
I prepare new logstash template compatible with ES 2.x and run new separated cluster with version 2.0.1 and another one separated with 2.1. I'm using logstash 2.1.0

Logs are send to 3 clusters with this part of code:

output {
elasticsearch {
hosts => "cluster17.es.service.consul"
template => "/etc/logstash/template_api.json"
index => "logstash-%{[@context][_index]}-%{+YYYY.MM.dd}"
template_overwrite => true
flush_size => 2000
retry_max_interval => 15
max_retries => 6
}
elasticsearch {
hosts => "cluster20.es.service.consul"
template => "/etc/logstash/template_api2.json"
index => "logstash-%{[@context][_index]}-%{+YYYY.MM.dd}"
template_overwrite => true
flush_size => 2000
retry_max_interval => 15
max_retries => 6
}
elasticsearch {
hosts => "cluster21.es.service.consul"
template => "/etc/logstash/template_api2.json"
index => "logstash-%{[@context][_index]}-%{+YYYY.MM.dd}"
template_overwrite => true
flush_size => 2000
retry_max_interval => 15
max_retries => 6
}
}

And I had weird problem.
In 1.7 cluster index from 1 day had:
logstash-api-2015.12.06 items: 11,473,555 size: 5.3GB
In 2.0.1 cluster:
logstash-api-2015.12.06 items: 9,609,880 size: 4.7GB
In 2.1 cluster:
logstash-api-2015.12.06 items: 9,608,696 size: 4.6GB

Difference between 1.7 and 2.x is huge. And for each full daily indexes 2.x had 15-18% less data.

I tested ES 2.X on different hardware hosts/vms to exclude hardware problems. Also there was no errors in logs.
I wrote script to compare indexes from 1.7 and 2.x and check what type of message is missing. But for each missing message I can POST it directly using curl to each cluster and everything saved without problems.
How to debug this issue ?

It sounds like that a shard is not back. Assuming you have 5 shards per index, that could represent almost 20%.

Did you look at the pending tasks?

@dadoonet all shards are alocated:

es 2.0.1:
{"cluster_name":"logstash20","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":21,"active_shards":21,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

es 2.1:
{"cluster_name":"logstash21","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":20,"active_shards":20,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

and pending_tasks on both clusters shows:
{"tasks":[]}

Ha! I totally missed the logstash part.

So you are sending your data to 3 clusters at the same time. But you have a different amount of logs.
1.7 and 2.0 have the same number of docs. But 2.1 has less.

When you said "Also there was no errors in logs.", did you mean logstash logs or elasticsearch logs?

@dadoonet: I'm sending data from logstash instance to 3 different ES cluster. And there is no errors in elasticsearch and logstash logs. Everything looks normal.
And when I post manually messages missed from 2.x clusters and existing in 1.7 I get confirmation with new inserted document _id.

I have no explanation.
May be you could open this thread in logstash forum so experts there might explain or trace things?

Sure, I write this also on logstash forum. Thanks for trying :slight_smile:

@warkolm So rest of logstash config I used:

input {
redis {
host => "10.8.34.27"
port => 6379
data_type => "list"
key => "logstash"
}
redis {
host => "10.8.34.34"
port => 6379
data_type => "list"
key => "logstash"
}
redis {
host => "10.8.34.36"
port => 6379
data_type => "list"
key => "logstash"
}
redis {
host => "10.8.38.42"
port => 6379
data_type => "list"
key => "logstash"
}
}
filter {

drop messages bigger than 16kb

range {
ranges => [ "message", 16384, 99999999, "drop" ]
}
if [type] == "syslog" {
grok {
patterns_dir => "/opt/logstash/patterns"
match => [
# apache/nginx access logs
"message", "%{APACHE_ACCESS_COMBINED_VHOST}",
"message", "%{NGINX_ACCESS_LOGS}",
# json-formatted messages
"message", "%{JSON_MESSAGES}",
# logs from border load balancers
"message", "%{BORDER_LB_LOG}",
# bind logs
"message", "%{NAMED_LOG}",
# logs from fastly
"message", "%{EDGE_CACHE_LOG}",
"message", "%{FASTLY_DEBUG_LOG}",
"message", "%{FASTLY_RESTARTS_LOG}",
"message", "%{FASTLY_HOTLINKS}",
"message", "%{S_MAX_DEBUG}",
"message", "%{CB_DEBUG_LOG}",
# pt-kill
"message", "%{PT_KILL}",
# powerconnect
"message", "%{POWERCONNECT_LOG}",
"message", "%{VARNISHNCSA}",
# normal syslog messages
"message", "%{SYSLOG_STANDARD}"
]
}
if "_grokparsefailure" in [tags] {
grok {
match => {
"message" => "%{GREEDYDATA:syslog_message}"
}
}
} else {
if [jsonMessage] != [null] {
json {
source => "jsonMessage"
}
mutate {
remove_field => ["jsonMessage"]
add_tag => ["json"]
}
} else if [httpversion] != [null] {
mutate {
add_tag => ["apache_access_log"]
}
if [request] =~ /^/api/v1/ {
mutate {
add_field => ["[@context][_index]","api"]
}
}
} else {
mutate {
rename => ["syslog_message", "@message"]
add_tag => ["message"]
}
}
syslog_pri {}
mutate {
rename => [ "syslog_hostname", "@source_host" ]
rename => [ "syslog_severity", "severity" ]
rename => [ "syslog_facility", "facility" ]
rename => [ "syslog_pri", "priority" ]
rename => [ "syslog_program", "program" ]
remove_field => [
"@version",
"host",
"message",
"syslog_facility_code",
"syslog_severity_code",
"syslog_timestamp",
"type"
]
}
}
}
#access log from fastly syslog
if [tags] == "edge-cache-requestmessage" {
mutate {
add_field => ["[@context][_index]","api"]
}
}

database queries killed

if [program] == "pt-kill" {
date {
match => [ "timestamp", "ISO8601" ]
}
mutate {
remove_field => [ "timestamp" ]
}
}
mutate{
lowercase => [ "@context", "_index" ]
}
}

It was actually logstash fault. After upgrading to 2.1.1 each cluster had this same amount of data in each index.

Thanks for the update.