Determining why records are missing after Logstash indexing

marinoc · October 8, 2015, 1:26pm

Hi - I'm attempting to index a single w 19,690 lines, each line will be it's own record in ES. The lines are pretty small, only 4-5 fields per line.

After I run logstash using the following command:

sudo bin/logstash -f config/hardball.conf --debug -l hardball.log

I see that my Elasticsearch index only contains 19,502 records.

I've tried looking in hardball.log but it's 200 MB.

Any recommendations on what to search for in the log file to see what might have happened to the missing 100+ records? Any other debugging tips w logstash?

Christian_Dahlqvist · October 8, 2015, 1:32pm

What does your logstash config look like? Did you wait for Elasticsearch to refresh before counting the documents?

marinoc · October 8, 2015, 1:50pm

Here's my config ... didn't see an option to attach a file.

# The # character at the beginning of a line indicates a comment. Use
# comments to describe your configuration.

##################################
# INPUT
##################################
input {
	file {
    path => "/var/data/hardball/players/baseballPersons.txt"
		type => "baseballPerson"
		sincedb_path => "/dev/null"
		start_position => "beginning"
	}
}


##################################
# FILTERS
##################################
filter {

  # Indexing retrosheet player data file w following fields:
  # Fields: LAST,FIRST,ID,DEBUT
  #  LAST
  #  FIRST
  #  ID
  #  DEBUT

  if [type] == "baseballPerson" {
    grok {
      patterns_dir => "/var/logstash/patterns"
      match => ["message", "%{PERSONNAME:lastName},%{PERSONNAME:firstName},%{BASEBALLPERSONID:baseballPersonID},%{DATE_US:dateDebut}"]
      add_field => { "baseballPersonType" => "%{baseballPersonID}" }
    }

    # grok {
      # patterns_dir => "/var/logstash/patterns"
      # match => ["message", "%{PERSONNAME},%{PERSONNAME},([a-z\-]{4})([a-z])%{SINGLEDIGIT:baseballPersonType}([0-9]{2}),%{DATE_US}"]
    # }

    mutate {
     gsub => [
      # parse first number which tells type of baseball person
      "baseballPersonType", "([a-z\-]{4})([a-z])([0-9])([0-9]{2})", "\3"
    ]
     remove_field => [ "host", "path" ]
   }
 }

}


##################################
# OUTPUT
##################################
output {
  elasticsearch {
    hosts => ["localhost"]
    index => "players"
    document_id => "%{baseballPersonID}"
  }
  # stdout { codec => json }
}

Christian_Dahlqvist · October 8, 2015, 2:01pm

As you are explicitly setting the ID of the documents, is it possible that your file contains duplicates? Do you have any records that for some reason has failed parsing and have been indexed with the string "%{baseballPersonID}" as a key?

marinoc · October 8, 2015, 2:09pm

I'll check, Christian. Would any of this be reported in logstash log so I could grep for it?

warkolm · October 8, 2015, 9:06pm

Not likely, no.

marinoc · October 13, 2015, 4:20pm

Thanks again, Christian. I went back a checked the IDs that made it to ES against a full list of IDs that should have made it and found that I had some parsing failures in my grok filter.

Basically, I used Linux commands to produce a list of IDs from my source file and the following command to produce a list of IDs from ES.

curl -XGET "localhost:9200/players/_search?pretty&from=0&size=20000" | grep '"_id"' | cut -d '"' -f 4 | sort -u > ES-IDs.txt

I compared these lists to see which were missing from ES and then went back to the source file to find out why a particular line had failed.

Is there a better way to debug these grok parse failure issues?

Christian_Dahlqvist · October 13, 2015, 4:27pm

You could add a separate output, e.g. to a daily file, and write any records that had a _grokparsefailure there. That could allow you to easily identify records with issues.

marinoc · October 13, 2015, 7:09pm

Thanks. I now see that I can grep for "_grokparsefailure" in my log file and identify which lines from my source file weren't parsed correctly. That's a big help!

Topic		Replies	Views
Elasticsearch's Docuements count dosen't match the exact number of input log lines Elasticsearch	10	156	December 27, 2023
My logs are missing in elasticsearch and kibana Logstash	4	2339	July 26, 2019
Indexing stops after 1000 records Logstash	1	610	July 6, 2017
The records in the elastic index are missing frequenlty Logstash	14	1098	May 12, 2020
Missing data in index Elasticsearch	3	1254	July 6, 2017

Determining why records are missing after Logstash indexing

Related topics