Missing CSV records in ElasticSearch


(Xavier Naud) #1

I am importing into Amazon Elasticsearch with logstash 5.6.2 a 120MB CSV file that contains 27,858 US cities with their polygons.
I run logstash with stdin and not with a file so it stops at the end.
when importing, I only have around 26,400 cities. The rest is MIA and no error in the output or any logs.

Here are some facts:

  • The file has one geo json field for the geo shapes and all are well formed
  • All document ids are uniq (and I get no deleted docs in the index).

I tried to use a persistent queue and set drain=true in the setting file but it does not change anything.
The process ends without error but I am still missing some records.

The settings file:
queue.type: persisted
queue.drain: true

the configuration file:

input {
    stdin {
    }
}
filter {
    csv {
        separator => "|"
        quote_char => "@"
        columns => ["gid", "community_id", "hj_id", "name", "state", "lon", "lat", "geo"]
        add_field => {
            "location" => "%{lat},%{lon}"
            "suggest" => "%{name}, %{state}"
        }
        remove_field => ["lat", "lon", "host", "message", "@version"]
    }
    mutate {
        convert => { "gid" => "integer" }
        convert => { "community_id" => "integer" }
    }
    json {
        source => "geo"
        target => "geo"
    }
}
output {
    stdout { codec => rubydebug }
    amazon_es {
        hosts => ["XXXX.es.amazonaws.com"]
        region => "XXX"
        aws_access_key_id => 'XXXX'
        aws_secret_access_key => 'XXXX'
        index => "cities-%{+YYYYMMdd}"
        template => "./cities-template.json"
        template_name => "cities"
        document_type => "city"
        document_id=>"%{hj_id}"
        }
}

Where shall I look at next?
How can I check what is in the persistent queue?
Do I need to restart logstash in a certain way to process records from the queue?

Thanks

Xavier


(Xavier Naud) #2

I isolated some of the records that don't show up and tried to index them manually.
Only then can I see the error:

"caused_by": {
"type": "invalid_shape_exception",
"reason": "Self-intersection at or near point (-75.21756305418718, 38.757002463054185, NaN)"
}

so the polygon is not valid.

Is there a way to get this error in some log file so I can get a list of invalid polygons?
do I need to enable some log4j logger?


(Mark Walkom) #3

Elasticsearch should respond with an error back to the client, is that happening?


(Xavier Naud) #4

ElasticSearch probably respond with an error, at least when I send the request manually but logstash does not report anything.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.