Missing CSV records in ElasticSearch

(Xavier Naud) #1

I am importing into Amazon Elasticsearch with logstash 5.6.2 a 120MB CSV file that contains 27,858 US cities with their polygons.
I run logstash with stdin and not with a file so it stops at the end.
when importing, I only have around 26,400 cities. The rest is MIA and no error in the output or any logs.

Here are some facts:

  • The file has one geo json field for the geo shapes and all are well formed
  • All document ids are uniq (and I get no deleted docs in the index).

I tried to use a persistent queue and set drain=true in the setting file but it does not change anything.
The process ends without error but I am still missing some records.

The settings file:
queue.type: persisted
queue.drain: true

the configuration file:

input {
    stdin {
filter {
    csv {
        separator => "|"
        quote_char => "@"
        columns => ["gid", "community_id", "hj_id", "name", "state", "lon", "lat", "geo"]
        add_field => {
            "location" => "%{lat},%{lon}"
            "suggest" => "%{name}, %{state}"
        remove_field => ["lat", "lon", "host", "message", "@version"]
    mutate {
        convert => { "gid" => "integer" }
        convert => { "community_id" => "integer" }
    json {
        source => "geo"
        target => "geo"
output {
    stdout { codec => rubydebug }
    amazon_es {
        hosts => ["XXXX.es.amazonaws.com"]
        region => "XXX"
        aws_access_key_id => 'XXXX'
        aws_secret_access_key => 'XXXX'
        index => "cities-%{+YYYYMMdd}"
        template => "./cities-template.json"
        template_name => "cities"
        document_type => "city"

Where shall I look at next?
How can I check what is in the persistent queue?
Do I need to restart logstash in a certain way to process records from the queue?



(Xavier Naud) #2

I isolated some of the records that don't show up and tried to index them manually.
Only then can I see the error:

"caused_by": {
"type": "invalid_shape_exception",
"reason": "Self-intersection at or near point (-75.21756305418718, 38.757002463054185, NaN)"

so the polygon is not valid.

Is there a way to get this error in some log file so I can get a list of invalid polygons?
do I need to enable some log4j logger?

(Mark Walkom) #3

Elasticsearch should respond with an error back to the client, is that happening?

(Xavier Naud) #4

ElasticSearch probably respond with an error, at least when I send the request manually but logstash does not report anything.

(system) #5

