Determining why records are missing after Logstash indexing

Hi - I'm attempting to index a single w 19,690 lines, each line will be it's own record in ES. The lines are pretty small, only 4-5 fields per line.

After I run logstash using the following command:

sudo bin/logstash -f config/hardball.conf --debug -l hardball.log

I see that my Elasticsearch index only contains 19,502 records.

I've tried looking in hardball.log but it's 200 MB.

Any recommendations on what to search for in the log file to see what might have happened to the missing 100+ records? Any other debugging tips w logstash?

1 Like

What does your logstash config look like? Did you wait for Elasticsearch to refresh before counting the documents?

Here's my config ... didn't see an option to attach a file.

# The # character at the beginning of a line indicates a comment. Use
# comments to describe your configuration.

##################################
# INPUT
##################################
input {
	file {
    path => "/var/data/hardball/players/baseballPersons.txt"
		type => "baseballPerson"
		sincedb_path => "/dev/null"
		start_position => "beginning"
	}
}


##################################
# FILTERS
##################################
filter {

  # Indexing retrosheet player data file w following fields:
  # Fields: LAST,FIRST,ID,DEBUT
  #  LAST
  #  FIRST
  #  ID
  #  DEBUT

  if [type] == "baseballPerson" {
    grok {
      patterns_dir => "/var/logstash/patterns"
      match => ["message", "%{PERSONNAME:lastName},%{PERSONNAME:firstName},%{BASEBALLPERSONID:baseballPersonID},%{DATE_US:dateDebut}"]
      add_field => { "baseballPersonType" => "%{baseballPersonID}" }
    }

    # grok {
      # patterns_dir => "/var/logstash/patterns"
      # match => ["message", "%{PERSONNAME},%{PERSONNAME},([a-z\-]{4})([a-z])%{SINGLEDIGIT:baseballPersonType}([0-9]{2}),%{DATE_US}"]
    # }

    mutate {
     gsub => [
      # parse first number which tells type of baseball person
      "baseballPersonType", "([a-z\-]{4})([a-z])([0-9])([0-9]{2})", "\3"
    ]
     remove_field => [ "host", "path" ]
   }
 }

}


##################################
# OUTPUT
##################################
output {
  elasticsearch {
    hosts => ["localhost"]
    index => "players"
    document_id => "%{baseballPersonID}"
  }
  # stdout { codec => json }
}

As you are explicitly setting the ID of the documents, is it possible that your file contains duplicates? Do you have any records that for some reason has failed parsing and have been indexed with the string "%{baseballPersonID}" as a key?

I'll check, Christian. Would any of this be reported in logstash log so I could grep for it?

Not likely, no.

Thanks again, Christian. I went back a checked the IDs that made it to ES against a full list of IDs that should have made it and found that I had some parsing failures in my grok filter.

Basically, I used Linux commands to produce a list of IDs from my source file and the following command to produce a list of IDs from ES.

curl -XGET "localhost:9200/players/_search?pretty&from=0&size=20000" | grep '"_id"' | cut -d '"' -f 4 | sort -u > ES-IDs.txt

I compared these lists to see which were missing from ES and then went back to the source file to find out why a particular line had failed.

Is there a better way to debug these grok parse failure issues?

You could add a separate output, e.g. to a daily file, and write any records that had a _grokparsefailure there. That could allow you to easily identify records with issues.

1 Like

Thanks. I now see that I can grep for "_grokparsefailure" in my log file and identify which lines from my source file weren't parsed correctly. That's a big help!

1 Like